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ABSTRACT 


Global  Combat  Support  System  -  Marine  Corp  is  a  large  logistics  system  designed  to  re¬ 
place  numerous  legacy  systems  used  by  the  Marine  Corps.  While  it  has  been  in  existence 
for  a  while,  its  intended  potential  has  not  been  fully  realized.  Therefore,  various  teams  are 
working  hard  to  develop  the  analytics  that  will  benefit  the  community.  With  the  growth 
of  data,  the  only  way  these  analytics  (in  Structured  Query  Language  [SQL])  will  run  ef¬ 
ficiently  will  be  on  proprietary  hardware  from  Oracle.  This  research  looks  at  running  the 
same  analytics  on  commodity  hardware  using  Hadoop  Distributed  File  System  and  Java 
Map  Reduce.  The  results  show  that  while  it  takes  longer  to  program  in  Java  (over  SQL), 
the  analytics  are  just  as,  or  even  more  powerful  ,as  SQL,  and  the  potential  to  save  on  hard¬ 
ware  cost  is  significant. 
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CHAPTER  1: 

Introduction 


Big  data  has  become  a  buzz  word  in  both  the  government  and  private  sectors.  That  does 
not  mean  that  big  data  is  not  a  worthwhile  venture.  There  is  no  hiding  from  the  fact  that 
data  continues  to  grow,  and  processing  enormous  amounts  of  data  using  conventional  tech¬ 
niques  can  become  unmanageable.  For  example,  the  Pentagon  is  attempting  to  expand  its 
worldwide  communications  network  to  handle  yottabyte  (YB)s  (10^^  bytes)  of  data  (A  YB 
one  trillion  terabyte  (TB)s)  [1].  As  the  data  increases  to  new  thresholds,  current  database 
architectures  struggle  to  keep  up.  This  thesis  examines  the  Global  Combat  Support  System 
-  Marine  Corp  (GCSS-MC)  database  and  how  to  apply  a  big  data  solution. 


1.1  The  Problem 

The  amount  of  data  being  stored  in  database  is  increasing  both  in  the  government  and  pub¬ 
lic  sectors.  The  United  States  Marine  Corp  (USMC)  is  no  different,  and  the  need  to  access, 
store,  and  process  large  amounts  of  data  exists.  The  USMC  is  currently  working  on  an 
enterprise  resource  planning  (ERP)  system  that  will  replace  all  of  its  legacy  logistics  sys¬ 
tems.  The  solution  that  has  been  developed  is  called  GCSS-MC.  GCSS-MC  is  a  complex 
undertaking  and  has  seen  its  share  of  problems  during  development.  The  system  is  making 
significant  achievements  and  is  becoming  a  useful  resource  to  the  USMC.  However,  the 
design  of  this  system  has  taken  a  significant  amount  of  time  to  implement  while  the  data 
continues  to  grow. 

Initially,  the  GCSS-MC  did  not  consider  a  big  data  element  in  the  system.  But  as  the 
amount  of  data  and  desire  to  keep  data  is  increasing,  the  USMC  needs  to  find  ways  to  add 
a  big  data  solution  to  GCSS-MC.  This  thesis  will  explore  a  method  to  implement  a  big 
data  element  into  GCSS-MC,  by  adding  a  Hadoop  Distributed  File  System  (HDFS)  cluster. 
Furthermore,  a  method  will  be  discussed  to  show  no  hardware  changes  will  need  to  be 
made  to  the  existing  GCSS-MC  architecture. 
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1.2  Research  Questions 

The  research  and  work  of  this  thesis  will  be  focused  on  addressing  the  following  questions: 

•  What  would  an  architecture  look  like  that  adds  a  big  data  element  to  GCSS-MC? 

•  If  an  architecture  can  be  developed,  what  modifications  would  the  GCSS-MC  archi¬ 
tecture  need? 

•  How  can  the  data  contained  in  GCSS-MC  be  imported  into  HDFS? 

•  What  type  of  analytics  can  Hadoop  provide  for  GCSS-MC  data? 

•  How  will  the  data  get  back  to  the  GCSS-MC  database? 

1.3  Contributions 

The  GCSS-MC  database  contains  structured  data  and  is  stored  in  a  relational  database.  This 
data  will  be  accessed  and  then  parsed  and  stored  in  the  HDFS  ecosystem.  The  data  will  then 
be  used  in  HDFS  to  run  analytics  for  the  GCSS-MC  system.  The  interesting  idea  in  that 
concept  is  that  HDFS  will  be  used  to  examine  all  of  the  data  in  the  corpus  and  then  perform 
some  calculations  and  return  the  data  in  a  different  format  to  the  GCSS-MC  database. 
This  becomes  particularly  useful  when  there  is  only  a  small  amount  of  data  that  needs  to 
be  derived  and  displayed  from  the  entire  data  corpus.  This  is  illustrated  in  two  separate 
analytic  programs.  The  programs  are  written  to  show  that  any  analytic  that  can  be  run  in  a 
typical  database  operation  can  be  accomplished  in  HDFS.  Furthermore,  HDFS  can  be  used 
to  perform  the  exact  same  queries  that  a  Structured  Query  Language  (SQL)  statement  can 
perform.  This  is  increasingly  interesting  when  the  data  exceeds  the  capability  of  standard 
database  designs. 

1.4  Thesis  Outline 

The  rest  of  the  chapters  in  this  thesis  will  further  expand  on  the  aforementioned  ideas  and 
research  questions.  Below  is  a  brief  outline  of  what  the  reader  can  expect  to  find  in  each 
chapter. 

Chapter  2  serves  as  a  starting  point  for  the  technology  covered  in  this  thesis.  The  chapter 
reviews  the  current  architecture  and  status  of  the  GCSS-MC  system.  The  chapter  also 
discusses  the  start  of  the  MapReduce  paradigm  through  the  introduction  of  Jeffrey  Dean 
and  Sanjay  Ghemawat’s  paper  on  the  MapReduce.  Furthermore,  an  explanation  and  walk 
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through  of  the  canonical  WordCount  example  is  given.  The  chapter  wraps  up  with  an 
introduction  of  the  HDFS  architecture. 

The  experiment  architecture  is  introduced  in  Chapter  3.  The  chapter  begins  with  some 
statistics  to  further  illustrate  the  need  for  big  data  solutions.  The  chapter  then  outlines  how 
a  big  data  element  could  be  added  to  the  current  GCSS-MC  architecture.  Finally,  it  explains 
how  the  Hadoop  cluster  is  created,  and  some  implementation  decisions  that  were  required. 

Chapter  4  details  the  specifics  of  the  experiment.  First,  there  is  a  discussion  of  the 
GCSS-MC  data  and  how  it  was  accessed.  Then  the  chapter  delineates  how  the  data  is 
parsed  and  stored  in  the  Hadoop  ecosystem.  There  are  also  two  separate  analytics  dis¬ 
cussed  in  Chapter  4.  The  first  analytic  is  the  Top  100  National  Stock  Number  (NSN) 
program  which  will  be  shown  in  detail.  The  second  analytic  is  a  program  called  Alpha,  an 
example  of  SQL  simulation.  Alpha  is  also  discussed  and  shown  in  detail. 

The  final  chapter  summarizes  the  findings  and  reiterates  the  research  questions  that  are 
presented  in  the  thesis.  Chapter  5  also  discusses  some  of  the  possible  future  work.  Such 
as:  adding  big  data  tools  (Hive,  Squoop,  Hbase,  etc.),  running  the  experiment  code  on  a 
production  level  HDFS  ecosystem,  and  further  optimizing  the  code. 
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CHAPTER  2: 
The  Current  State 


The  Department  of  Defense  (DOD)  creates,  monitors,  and  views  petabyte  (PB)s  of  data 
every  day.  The  current  data  management  software  and  hardware  make  it  very  difficult 
to  manage,  organize,  and  parse  large  datasets.  Moreover,  the  data  resides  across  several 
databases.  This  infrastructure  makes  it  difficult,  or  even  impossible,  for  a  user  who  needs 
to  aggregate  data  from  several  databases  to  make  an  intelligent  deductions  on  that  data.  The 
DOD  has  significant  work  remaining  to  streamline  all  of  their  data  into  one  space.  There 
are  several  projects  working  on  a  solution  to  place  all  of  the  data  in  a  cloud  environment 
that  allows  all  users  to  access  the  data  payload.  These  programs  vary  greatly  across  the 
DOD  and  no  front-runner  sets  the  standard.  Even  before  the  DOD  can  start  using  the  cloud 
as  a  large  data  store,  several  steps  need  to  take  place  to  get  the  DOD  ready  to  move  to  a 
cloud  environment. 

The  DOD  took  on  the  challenge  of  bringing  all  of  the  legacy  business  systems  into  a  co¬ 
hesive  system  in  the  1990s.  DOD’s  business  systems  are  information  systems,  including 
financial  and  non-financial  systems  that  support  DOD  business  operations,  such  as  civil¬ 
ian  personnel,  finance,  health,  logistics,  military  personnel,  procurement,  and  transporta¬ 
tion  [2].  This  process  has  become  known  as  the  ERP  and  entails  an  automated  system 
using  commercial  off  the  shelf  (COTS)  software  consisting  of  multiple,  integrated  func¬ 
tional  modules  that  perform  a  variety  of  business  related  tasks  such  as  general  ledger  ac¬ 
counting,  payroll,  and  supply  chain  management  [2].  ERP  processes  across  the  DOD  have 
been  widely  publicized  for  being  behind  schedule  and  cost  overruns  (because  of  the  size, 
complexity,  and  significance  to  the  DOD).  ERP  has  been  placed  on  the  Government  Ac¬ 
countability  Office  (GAO)  high-risk  list  [2]. 

In  total,  there  are  nine  ERP  solutions  being  developed  for  the  DOD.  All  of  these  systems 
hope  to  replace  over  500  legacy  systems.  Replacing  these  systems  in  to  aggregate  systems 
data  will  save  the  DOD  money  and  give  the  end  user  a  more  powerful  system  that  they 
can  use  to  help  manage  their  work  and  use  to  run  analytics  that  can  possibly  help  them 
improve  their  work  methods.  The  ERP  solution  is  an  ongoing  process,  that  recently  has 
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put  in  significant  efforts  towards  a  successful  project.  The  USMC  solution  for  ERP  is  the 
GCSS-MC.  The  GCSS-MC  will  replace  legacy  systems  and  tailor  a  solution  that  is  useful 
to  both  the  Marine  in  the  field  and  the  Marine  shipping  the  supplies. 


2.1  GCSS-MC:  The  USMC  ERP  Solution 

The  GCSS-MC  system  was  designed  to  run  off  of  COTS  architecture  and  combine  all  of 
the  USMC  legacy  business  systems  in  to  one  system.  This  process  has  been  ongoing  since 
November  2003,  and  it  was  intended  to  deliver  integrated  functionality  across  the  logistics 
areas.  [3]  This  process  has  been  incremented  slowly  through  the  delivery  of  COTS  archi¬ 
tecture  to  bring  all  of  the  USMC  logistic  tools  to  one  place.  Figure  2. 1  is  an  example  of 
the  system  architecture,  from  [4].  The  program  has  had  its  setbacks  and  successes.  The 
overwhelming  process  of  replacing  up  to  forty-eight  legacy  systems  is  not  easily  accom¬ 
plished  [3].  This  system  is  attempting  to  design  a  replacement  for  all  legacy  systems  in 
one  place,  which  makes  the  GCSS-MC  system  incredibly  complex.  The  system  has  to  be 
designed  to  meet  the  needs  of  the  USMC  both  at  the  garrison  and  in  the  field,  all  while 
maintaining  real  time  data  updates  over  the  entire  system.  One  can  see  that  this  is  a  dif¬ 
ficult  undertaking  and  should  not  be  taken  lightly.  The  point  of  this  thesis  is  not  to  take 
on  the  challenge  of  what  should  or  should  not  have  been  done  either  during  design  or  the 
implementation  of  GCSS-MC.  Moreover,  this  thesis  does  not  aim  to  replace  and  change 
the  current  architecture  of  the  GCSS-MC  system.  The  main  aim  of  this  thesis  is  to  show  a 
proof  of  concept  that  big  data  analytics  can  be  added  to  the  current  GCSS-MC  architecture. 
GCSS-MC  uses  proven  Oracle  technology  to  give  the  users  a  reliable  relational  database. 
The  work  on  GCSS-MC  is  not  done  and  its  implementation  process  continues,  but  does  not 
yet  incorporate  a  big  data  element.  Big  Data  elements  would  allow  the  USMC  the  ability  to 
store  and  analyze  more  data.  Currently  the  only  data  that  GCSS-MC  addresses  is  structured 
data. 
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Figure  2.1:  GCSS-MC  Architecture 


application 


2.2  Big  Data 

Big  data  has  become  a  buzz  word  both  in  industry  and  the  government  sector;  evidence 
of  this  is  shown  when  President  Obama  started  a  Big  Data  Research  and  Development 
Initiative  in  March  2012  [5].  It  has  helped  shift  how  data  is  thought  about  and  what  can 
be  done  with  data.  Big  data  does  not  answer  all  questions  and  there  are  limitations  to  what 
it  can  do.  Big  data  can  take  the  form  of  MapReduce,  NoSQL,  Big  Table,  or  something 
completely  different.  The  Big  data  industry  changes  rapidly,  and  new  tools  and  architecture 
are  constantly  being  introduced.  These  tools  all  provide  some  benefit  and  some  drawbacks. 
No  one  technology  provides  everything  an  end  user  would  want.  There  is  currently  no 
defacto  standard  and  the  arguments  exists  as  to  what  technology  is  the  best.  The  simple 
answer  is  that  they  all  provide  something,  and  it  really  depends  on  what  type  of  data,  how 
much  data,  and  what  the  desired  analytics  are. 
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2.3  MapReduce 

One  of  the  major  eore  teehnologies  for  big  data  is  MapReduee.  This  proeess  is  used  heav¬ 
ily  at  Google  and  in  Apache’s  HDFS  system  [6],  [7].  Google’s  development  in  the  Map 
Reduce  space  is  largely  what  pushed  the  open  source  community  to  answer  with  HDFS  [8]. 
Google’s  efforts  were  documented  in  Jeffrey  Dean  and  Sanjay  Ghemawat’s  paper  on  the 
MapReduce  paradigm  [6].  This  paper  is  what  gave  the  data  science  community  a  whole 
a  look  at  how  Google  operated  their  computer  cluster  to  perform  a  large  number  of  tasks. 
The  Google  MapReduce  paper  was  the  first  publication  illustrating  a  clustered  MapReduce 
framework.  Since  this  paper  has  been  published  several  other  models  have  been  developed 
Google’s  model.  Apache  Hadoop  is  an  example  of  cluster  computing  environment  created 
after  this  paper  was  published.  This  paper  is  accepted  as  a  seminal  paper  and  has  been  cited 
in  thousands  of  other  publications. 

MapReduce  is  a  programming  model  that  was  developed  by  Google  as  an  answer  to  how 
to  process  and  create  large  amounts  of  data.  At  Google,  MapReduce  jobs  are  run  every¬ 
day  and  they  process  more  than  twenty  PBs  of  data  per  data  [6].  The  programming  model 
was  developed  to  abstract  the  complexity  of  cluster  computing  away  from  the  programmer. 
Parallel  computing  is  extremely  difficult,  and  if  the  programmer  has  to  handle  how  to  par¬ 
allelize  the  computation,  how  the  system  distributes  the  data  around  the  cluster,  failures  of 
cluster  nodes,  and  as  well  as  the  data,  the  task  can  be  overwhelming.  Google  developed  a 
system  and  a  programming  model  that  abstracts  that  from  the  user.  The  programmer  can 
then  write  code  that  can  be  executed  on  the  cluster  though  the  Google  File  System  [9]. 
The  program  then  only  has  to  deal  with  writing  the  code  to  handle  the  calculation  and  the 
distribution  of  the  data;  how  it  is  parallelized  is  all  handled  by  the  underlying  system.  This 
allows  the  programmer  to  focus  on  the  algorithms  and  not  specify  the  parallel  features. 

Google  found  that  most  of  their  data  computations  involved  mapping  the  data  to  a  compu¬ 
tation  and  then  storing  the  resultant  key /value  pairs.  Google  further  took  their  inspiration 
from  the  map  and  reduce  primitives  that  were  available  in  Lisp  [6].  Furthermore,  this  led 
to  the  development  of  the  MapReduce  programming  model.  The  programmer  will  specify 
both  the  map  and  reduce  function  to  handle  the  data  in  key/value  pairs.  The  map  function 
will  first  take  the  data  and  map  it  to  a  logical  record  of  intermediate  key /value  pairs.  Then 
the  intermediate  key/value  pairs  are  sent  to  the  reduce  function  to  apply  the  same  function 
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to  all  of  the  values  that  are  the  same.  The  eanonieal  example  of  this  is  the  word  eount 
problem  that  will  be  further  diseussed  in  Section  2.4. 

Google  focused  on  implementing  this  cluster  on  a  large  number  of  commodity  personnel 
computer  (PC)s  with  a  Gigabit  Ethernet  network  [9].  This  allowed  them  to  quickly  and 
cheaply  build  a  large  scale  cluster.  This  idea  of  using  commodity  PCs  is  important  be¬ 
cause  prior  to  this  framework  being  developed,  the  conventional  thinking  was  to  have  a 
super  computing  environment  which  needed  very  powerful,  expensive  servers.  The  fact 
that  Google  showed  that  the  computational  power,  storage,  and  speed  could  be  mimicked 
with  a  bunch  of  PCs  allowed  the  parallel  computing  community  to  expand  into  other  areas 
and  make  super  computing  more  achievable  to  more  people  [10].  Google’s  implementation 
also  has  thousands  of  machines,  and  with  that  many  machines  failures  are  common.  Since 
failures  are  common,  Google’s  framework  supports  fault  tolerance  and  allows  a  machine  to 
be  replaced,  and  since  the  machines  are  PCs,  the  cost  to  replace  them  is  significantly  lower 
than  a  server. 

The  Google  framework  executes  [6]  the  MapReduce  program  by  first  splitting  the  input 
files  into  pieces,  these  pieces  typically  range  from  16-64MB  per  piece.  These  are  then  sent 
to  worker  nodes,  and  a  map  function  is  then  started  up  on  each  worker  node  that  has  been 
assigned  a  data  split.  One  of  the  worker  nodes  is  designated  as  the  master.  The  master  is 
responsible  for  designating  the  worker  nodes  for  both  the  map  and  reduce  functions,  he 
master  will  attempt  to  assign  the  tasks  to  idle  nodes  in  the  cluster.  Each  worker  that  is 
assigned  a  data  input  split  will  perform  a  map  function  on  the  input  data  and  will  parse 
the  input  data  into  intermediate  key/value  pairs.  The  values  are  written  to  the  local  disk 
through  a  buffer.  The  location  of  the  intermediate  values  is  then  sent  to  the  master.  This  is 
needed  so  the  master  knows  where  to  tell  the  reduce  workers  where  their  input  data  resides. 
A  reduce  worker  node  will  then  make  a  call  to  get  the  data  from  the  location  passed  to  it 
by  the  master.  The  reduce  node  will  read  all  of  the  data  into  its  partition,  and  it  will  sort 
the  intermediate  data  be  the  key  value.  The  reduce  node  will  then  iterate  over  the  data  for 
each  unique  key  value  and  send  that  to  the  reduce  function  specified  by  the  program.  The 
reduce  node  appends  its  output  to  the  program  specified  output  file.  When  all  of  the  map 
and  reduce  tasks  report  completion  to  the  master,  the  master  will  wake  up  the  program  and 
the  MapReduce  call  is  returned  back  to  that  program.  A  pictorial  example  of  this  process 
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is  shown  in  Figure  2.2,  from  [6]. 


Input 

files 


Map  Inlermcdiale  files  Reduce  Output 

phasr  (on  local  disks)  phase  files 


Figure  2.2:  Google  Execution  Overview 


2.4  Word  Count 

The  MapReduee  programming  paradigm  ean  be  used  to  proeess  all  kinds  of  data  and  ean 
become  very  convoluted.  The  widely  accepted  "Hello  World"  of  MapReduce  is  the  word 
count  program.  This  example  has  been  used  by  Google  [6]  and  is  also  used  as  an  example 
by  Hadoop  [11].  Figure  2.3  is  the  source  code  for  the  WordCount  program  form  the 
Hadoop  web  page  [11].  This  is  the  example  code  that  we  will  step  through  to  example  the 
MapReduce  process. 
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package 

org  .  myorg  ; 

import 

java  .  io  .  lOException  ; 

import 

java,  util 

import 

org.  apache .hadoop 

fs  .  Path  ; 

import 

org  . apache . hadoop 

conf . * ; 

import 

org  .  apache . hadoop 

io  .  * ; 

import 

org  . apache . hadoop 

mapred  .  * 

import 

org  .  apache . hadoop 

util  .  * ; 

public 

class  WordCount  { 

public  static  class  Map  extends  MapReduceBase  implements  Mapper<LongWritable  ,  Text, 

Text.  IntWritable>  { 

private  final  static  IntWritable  one  =  new  Int Writable  ( 1 ) ; 
private  Text  word  =  new  Text(); 

public  void  map(  LongWritable  key,  Text  value,  OutputCollector  <Text  ,  IntWritable>  output, 
Reporter  reporter)  throws  lOException  { 

String  line  =  value  .  toString  () ; 

StringTokenizer  tokenizer  =  new  S tr i ngTokenizer  (  1  i  n e  ) ; 
while  (tokenizer.  hasMoreTokens  () )  { 
word  .  set(tokenizer  .  nextToken  ()); 
output  .  collect  ( word  ,  one  ) ; 


public  static  class  Reduce  extends  MapReduceBase  implements  Reducer<Text  ,  IntWritable  , 
Text  ,  IntWritable  >{ 

public  void  reduce(Text  key,  Iterator  <IntWritable  >  values,  OutputCollector  <Text  , 
IntWritable>  output.  Reporter  reporter)  throws  lOException  { 
int  sum  =  0; 

while  (  values  .  hasNext  ())  { 
sum  +=  values  .  next  ().  get  () ; 

) 

output  .  collect  (  key  ,  new  IntWritable  (sum  )  )  ; 


public  static  void  main (  S t ring  [ ]  args)  throws  Exception  { 
JobConf  conf  =  new  JobConf  (WordCount .  class  ) ; 
conf.  setJobName(  "wordcount "  ); 

conf.  setOutputKeyClass(Text  .  class  ); 

conf.  setOutputValueClass(  IntWritable  .  class  ); 

conf.  setMapper  Class  (Map.  class  ) ; 
conf.  setCombinerClass(  Reduce,  class  ); 
conf.  setReducerClass( Reduce,  class  ); 

conf.  setInputFormat(TextInputFormat  .class  ); 
conf.  setOutputForiTiat(TextOutputFormat.  class  ); 

FileInputFormat  .  set  Input  Paths  (conf,  new  Path(args  [0])); 
FileOutputFormat  .  setOutputPath  (conf  ,  new  Path(args[l])); 

JobClient  .  runJob(conf); 

) 

I 


Figure  2.3:  Fladoop  WordCount  Program 
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The  first  section  of  code  (Figure  2.4)  allows  the  program  access  to  the  Hadoop  libraries 
needed  to  execute  the  code  in  the  Hadoop  environment.  The  first  two  packages  are  Java 
packages  that  are  not  specific  to  the  Hadoop  Libraries.  The  first  imported  package  is  the 
exception  package.  This  allows  the  program  to  throw  the  exceptions  it  encounters  and  will 
provide  a  graceful  shutdown  and  some  help  in  debugging.  The  next  import  is  the  Java  util¬ 
ities  package  that  allows  the  program  to  have  access  to  the  iterator  class  and  the  string  tok- 
enizer  class.  The  next  five  packages  are  all  Hadoop  specific.  The  org.apache.hadoop.fs.Path 
package  allows  the  program  to  access  the  Hadoop  file  system.  This  is  needed  both  for  the 
import  of  the  files  needed  to  run  the  program  and  the  export  path  of  the  result.  The  next 
package,  org.apache.hadoop.conf.*,  is  needed  to  set  the  job  up.  This  is  the  part  of  the  code 
that  is  executed  in  the  main  method.  The  package  org.apache.hadoop.io.*  is  need  to  write 
to  the  read  into  the  program  and  write  out  of  the  program  and  is  additionally  needed  to  write 
to  logs,stdout,  and  stderr.  The  Hadoop  utility  package,  org. apache. hadoop. util.*,  is  used  to 
report  progress  on  the  program  during  execution.  Finally,  the  org. apache. hadoop. mapred.* 
package  is  where  the  program  is  getting  access  to  the  majority  of  the  classes  need.  This 
package  includes  the  class  definitions  for  the  map  and  reduce  methods.  It  will  also  give  the 
program  the  class  definitions  for  the  input  format,  the  input  split,  how  the  job  is  configured, 
the  output  collector,  and  the  output  format. 


1  import  java  .  io  .  lOException  ; 

2  import  java,  util 

3 

4  import  org.apache.hadoop.fs.Path; 

5  import  org.apache.hadoop.conf.*; 

6  import  org.apache.hadoop.io.*; 

7  import  org  .  apache  .  hadoop  .  mapred  .*  ; 

8  import  org  .  apache  .  hadoop  .  u  t  i  1  .*  ; 

Figure  2.4:  Fladoop  WordCount  Program:  Libraries 


The  map  class  is  going  to  be  the  next  area  of  focus  (Figure  2.5).  The  program  first  has  to 
make  the  map  deceleration. 

1  public  static  class  Map  extends  MapReduceBase  implements  Mapper<LongWritable  ,  Text, 

2  Text,  IntWritable> 
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In  this  case,  Map  is  the  program  defined  class  name  and  it  extends  all  of  the  properties  of 
the  MapReduceBase  class.  The  MapReduceBase  is  what  allows  the  Map  class  to  be  called 
in  the  job.  Next  the  Map  class  must  implement  the  Mapper  as  an  interface.  This  is  the 
class  that  actually  maps  the  input  to  the  intermediate  output.  The  program  must  specify 
the  input  key/value  data  type  and  the  output  key/value  pair  data  type  that  will  become  the 
intermediate  data.  Here  the  input  key  is  the  byte  offset  of  the  line  from  the  input  document 
and  the  value  is  the  entire  line  of  the  input.  The  map  class  then  must  declare  a  map  method. 

public  void  map(  LongWritable  key.  Text  value,  OutputCollector  <Text  ,  IntWritable>  output, 
Reporter  reporter) 


The  method  will  again  specify  the  input  key /value  pairs.  The  method  will  declare  the 
OutputCollector  interface  and  set  the  key /value  output  type  and  give  the  OutputCollector 
a  name.  The  Reporter  type  is  what  will  report  status  back  to  the  Hadoop  system.  This 
map  method  set  the  output  to  be  Text  and  an  IntWritable,  both  of  which  are  Hadoop  types. 
Finally  the  output  statement  collects  writes  the  intermediate  data  out. 

output  .  collect  (\vord  ,  one); 


public  static  class  Map  extends  MapReduceBase  implements  Mapper<LongWritable  ,  Text, 

Text,  IntWritable  >  { 

private  final  static  IntWritable  one  =  new  IntWritable  ( 1 ) ; 
private  Text  word  =  new  Text(); 

public  void  map(  LongWritable  key,  Text  value,  OutputCollector  <Text  ,  IntWritable  >  output. 
Reporter  reporter)  throws  lOException  { 

String  line  =  value  .  to S tri n g  () ; 

StringTokenizer  tokenizer  =  new  StringTokenizer  ( line  ) ; 
while  ( tokenizer  .  hasMoreTokens  ())  { 

word  .  set  (  tokenizer  .  nextToken  ()) ; 
output  .  collect  (word  ,  one); 


Figure  2.5:  Fladoop  WordCount  Program:  Map  Class 


Next  the  reduee  elass  will  take  over  (Figure  2.6).  The  reduee  elass  has  to  be  speeified  in 
the  program  and  must  aeeept  the  input  data  types  defined  by  the  map  data  type  output.  The 
reduee  elass  deelaration  is  the  first  thing  that  must  be  eonsidered. 
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public  static  class  Reduce  extends  MapReduceBase  implements  Reducer<Text  , 
IntWritable  ,  Text,  IntWritable  > 


As  with  the  map  class,  in  order  to  run  as  a  job  the  reduce  class  must  also  extend  the  MapRe¬ 
duceBase  class.  Then,  the  reduce  class  implements  the  Reducer  interface  and  must  declare 
the  key /value  input  and  output  data  types.  This  example  shows  the  case  where  the  reduce 
class  key/value  input  and  output  are  the  same;  this  is  not  a  constraint.  The  reduce  class 
could  output  whatever  the  programmer  chooses  as  long  as  it  is  a  Hadoop  data  type.  The 
reduce  class  must  define  a  reduce  method. 

public  void  reduce(Text  key,  Iterator  <IntWritable  >  values, 

OutputCollector  <Text  ,  IntWritable  >  output.  Reporter  reporter) 


The  reduce  method  declaration  is  similar  to  the  map  declaration.  The  OutputCollector  and 
Reporter  declarations  are  the  same.  The  difference  here  is  the  declaration  of  the  input 
key  value  pairs.  The  method  must  define  the  key  type  and  then  an  iterator  for  the  input 
value  type.  This  is  because  before  the  intermediate  data  gets  to  the  reduce  class,  it  will 
be  shuffled,  and  the  reduce  class  will  have  all  the  the  intermediate  data  that  has  the  same 
key.  So  the  reduce  method  will  set  the  key  and  the  iterate  over  all  the  values  that  exist  in 
the  intermediate  data  that  have  the  same  key  that  is  set.  This  is  where  the  output  key/value 
pairs  are  reduced  in  the  manner  specified  by  the  program.  In  this  example,  the  final  value 
for  each  key  is  just  the  sum  of  all  of  the  values.  Finally,  the  reduce  class  has  to  set  the 
output.  This  is  a  similar  to  call  the  map  output  statement. 

output  .  collect  (key  ,  new  IntWri table  ( sum  )) ; 


public  static  class  Reduce  extends  MapReduceBase  implements  Reducer<Text  ,  IntWritable  , 
Text  ,  IntWritable  >{ 

public  void  reduce(Text  key.  Iterator  <IntWritable  >  values,  OutputCollector  <Text  , 
IntWritable>  output.  Reporter  reporter)  throws  lOException  { 
i  n  t  sum  =  0 ; 

while  (  values  .  hasNext  ())  { 

sum  +=  values  .  next  ().  get  () ; 

1 

output  .  collect  (key  ,  new  Int  Writ  able  ( sum  )) ; 


Figure  2.6:  Fladoop  WordCount  Program:  Reduce  Class 
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The  final  piece  the  program  must  implement  is  a  main  method  (Figure  2.7).  The  main 
method  is  responsible  for  the  set  up  of  the  MapReduce  job.  This  is  done  by  declaring  a 
JobConf  object  and  telling  it  where  the  map  and  reduce  methods  are  found.  This  example 
the  classes  are  found  in  the  WordCount  class. 

JobConf  conf  =  new  JobConf  ( WordCount .  c  1  a s  s  ) ; 

The  JobConf  typically  sets  the  mapper,  reducer,  input  format,  output  format  [12],  but  it 
also  allows  the  programmer  to  manipulate  the  job  here.  The  programmer  has  the  access 
here  to  set  specific  combiner  classes,  use  distributed  cache,  use  custom  comparators,  and 
even  manipulate  how  the  program  executes  by  changing  memory  requirements  for  the  map 
and  reduce  tasks.  The  JobConf  also  allows  the  setting  of  the  input  and  output  path.  In  this 
example  it  is  set  form  the  command  line  arguments. 

JobConf  conf  =  new  JobConf  ( WordCount .  c  1  as  s  ) ; 

conf.  setJobName("wordcount''  ); 

conf.  setOutputKeyClass(Text.  class  ); 

conf.  setOutputValueClass(IntWritable  .  class  ); 

conf.  setMapperClass  (Map .  class  ) ; 

conf.  setCombinerClass(  Reduce  .  class  ) ; 

conf.  setReducerClass(  Reduce  .  class  ) ; 

conf.  setInputFormat(TextInputFormat.  class  ); 

conf.  setOutputFormat(TextOutputFormat  .  class  ); 

FileInputFormat  .  setInputPaths  (conf  ,  new  Path  (  arg s  [ 0 ] ) ) ; 

FileOutputFormat  .  setOutputPath  (  conf  ,  new  Path  (  args  [  1  ] ) ) ; 

JobClient  .  runJob(conf); 


The  final  piece  to  wrap  up  here  is  a  walk  through  of  the  WordCount  program  execution. 
The  idea  is  to  use  a  small  data  set  and  walk  through  the  program  execution  to  follow  the 
data  flow  and  data  transformation.  This  example  will  look  at  the  nursery  rhyme  Jack  and 
Jill  (Figure  2.8). 
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1  Jack  and  Jill  went  up  the  hill 

2  To  fetch  a  pail  of  water. 

3  Jack  fell  down  and  broke  his  crown, 

4  And  Jill  came  tumbling  after 

Figure  2.8:  Jack  and  Jill  Nursery  Rhyme 


For  the  purpose  of  this  walk  through,  we  will  only  consider  one  map  task  and  one  reduce 
task.  The  first  thing  that  will  happen  is  the  first  line  of  the  file  will  be  read  in  by  the  Map 
class.  The  first  line  will  come  in  with  a  byte  offset  of  0,  therefore  the  key  value  be  zero 
and  the  value  of  the  line  is  "Jack  and  Jill  went  up  the  hill".  Figure  2.9  shows  the  input 
key/values  pairs  for  the  whole  text. 

1  0  Jack  and  Jill  went  up  the  hill 

2  31  To  fetch  a  pail  of  water. 

3  57  Jack  fell  down  and  broke  his  crown, 

4  93  And  Jill  came  tumbling  after 

Figure  2.9:  Byte  Offset  for  Jack  and  Jill  Nursery  Rhyme 


The  line  is  then  parsed  using  a  StringTokenizer,  which  sets  each  word  as  a  token.  After  the 
tokens  are  set,  a  while  loop  is  set  up  on  the  condition  that  there  are  more  tokens.  This  while 
loop  will  then  set  a  Text  variable  to  the  String  value  of  each  token  and  then  set  the  output 
to  the  word  and  one.  This  sends  an  intermediate  data  value  of  ("word",  1)  for  each  word  in 
the  line.  The  intermediate  value  of  the  first  line  will  be  as  follows: 


1  Jack  1 

2  and  1 

3  Jill  1 

4  went  1 

5  up  1 

6  the  1 

7  hill  1 


This  process  is  repeated  for  each  line  in  the  input  file  producing  intermediate  outputs  of 
("word",  1)  for  each  word  that  occurs  in  the  file.  This  produces  the  intermediate  data  that 
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will  then  be  sorted  and  shuffled  prior  to  going  to  the  reduce  method,  Figure  2. 10  illustrates 
the  intermediate  values  for  the  entire  file. 


1  Jack  1 

2  and  1 

3  Jill  1 

4  went  1 

5  up  1 

6  the  1 

7  hill  1 

8  To  1 

9  fetch  1 

10  a  1 

11  pail  1 

12  of  1 

13  water  1 

14  Jack  1 

15  fell  1 

16  down  1 

17  and  1 

18  broke  1 

19  his  1 

20  crown  1 

21  And  1 

22  Jill  1 

23  came  1 

24  tumbling  1 

25  after  1 


Figure  2.10:  Intermediate  Key/Value  Pairs  from  the  Map  Method 


The  sorting  and  shuffling  process  then  takes  over  and  sorts  the  data.  The  data  is  sorted 
into  alphabetical  order  based  on  the  key.  The  shuffling  process  would  shuffle  the  data  into 
logical  units  and  then  send  data  in  logical  sets  to  reduce  tasks.  Again,  there  is  only  one 
reduce  task  so  it  will  get  the  entire  data  set.  The  input  to  the  reduce  method  is  illustrated  in 
Figure  2.11. 
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1  And  1 

2  Jack  1 

3  Jack  1 

4  Jill  1 

5  Jill  1 

6  To  1 

7  a  1 

8  after  1 

9  and  1 

10  and  1 

11  broke  1 

12  came  1 

13  crown  1 

14  down  1 

15  fell  1 

16  fetch  1 

17  hill  1 

18  his  1 

19  of  1 

20  pail  1 

21  the  1 

22  tumbling  1 

23  up  1 

24  water  1 

25  went  1 


Figure  2.11:  Shuffled  Intermediate  Key /Value  Pairs  from  the  Map  Method 


Notice  this  program  looks  at  "and"  and  "And"  as  different  words.  This  is  how  the  map 
method  is  programmed,  and  if  this  result  would  be  undesirable,  then  programmer  could 
use  string  manipulation  or  Regex  to  get  rid  of  this.  The  reduce  method  will  then  take  over. 
Here,  the  reduce  method  will  take  each  key  in  and  iterate  over  all  values  for  that  key.  In 
this  reduce  method,  a  int  variable  is  set  and  used  to  sum  the  number  of  occurrences  of  each 
word.  For  instance,  the  word  "Jack"  is  seen  twice,  the  reducer  will  set  the  value  of  sum  to 
one  on  the  first  occurrence  of  "Jack"  by  the  call: 

1  sum  +=  values  .  next  ().  get  () ; 


At  this  point,  the  sum  is  set  to  one.  On  the  next  occurrence  of  "Jack",  the  same  sum  call  is 
executed,  and  then  sum  is  set  to  two.  This  process  occurs  until  there  are  no  more  values  for 
the  key  "Jack".  In  this  case,  there  are  no  more  values,  so  the  reduce  class  then  makes  the 


18 


1 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 


call  to  write  the  output  of  the  key  and  the  sum  by: 

output  .  collect  (key  ,  new  Int Writable  ( sum  )) ; 


The  reduce  method  will  get  called  for  each  unique  key  in  the  intermediate  data  and  write 
one  output  statement  for  each  key  until  is  has  seen  all  of  the  keys  in  the  intermediate  data. 
The  final  result  in  this  example  is  seen  in  Figure  2.12. 

And  1 
Jack  2 
Jill  2 
To  1 
a  1 

after  1 
and  2 
broke  1 
came  1 
crown  1 
down  1 
fell  1 
fetch  1 
hill  1 
his  1 
of  1 
pail  1 
the  1 

tumbling  1 
up  1 
water  1 
went  1 


Figure  2.12:  Final  Output  of  WordCount 


2.5  Hadoop 

The  Hadoop  Distributed  File  System  is  an  open  source  project  that  was  modeled  after  the 
Google  Architecture  outlined  in  Jeffery  Dean  and  Sanjay  Ghemawat’s  MapReduce  paper 
[6] .  Hadoop  started  out  as  an  open  source  web  search  engine  called  Nutch  [7] .  The  Nutch 
project  was  having  some  trouble  handling  the  distributed  nature  of  computations  that  were 
needed  for  a  web  search  engine.  Then,  from  the  Google  papers  on  MapReduce  and  the 
Google  File  System  (GFS),  the  project  began  to  catch  hold  [7],  [6],  [9].  At  that  time, 
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Yahoo!  became  interested  in  the  project,  and  Hadoop  split  from  the  Nutch  project  and 
became  what  it  is  today.  HDFS  is  the  distributed  file  system  that  supports  MapReduce 
framework,  and  Figure  2.13  illustrates  the  HDFS  architecture,  from  [11].  The  distributed 
file  system  is  set  up  to  be  fault  tolerant  and  spread  data  over  several  nodes  in  a  cluster.  The 
data  replication  factor  is  nominally  set  to  three.  However,  this  can  be  changed  and  adapted 
to  meet  the  specific  needs  of  a  given  cluster.  The  most  basic  cluster,  a  single  node,  has  the 
following  services  running  NameNode,  JobTracker,  TaskTracker,  Secondary  NameNode, 
and  DataNode. 


The  NameNode  is  known  as  the  master  in  the  cluster.  The  NameNode  is  responsible  for  the 
file  system  namespace  [13].  The  NameNode  is  the  node  that  manages  the  whole  cluster. 
It  maintains  the  file  system  tree  and  stores  all  the  metadata  for  all  of  the  directories  and 
files  within  the  distributed  file  system.  The  NameNode  is  the  most  important  node  in  the 
cluster  because  it  is  needed  to  maintain  all  of  the  files  in  the  distributed  file  system.  If  the 
NameNode  were  lost  than  all  of  the  files  in  the  distributed  file  system  would  be  lost  because 
the  NameNode  maintains  the  file  structure  to  be  able  to  put  the  blocks  back  together.  The 
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NameNode  stores  the  metadata  and  file  tree  in  two  files;  the  namespaee  image  and  an  edit 
log  [7].  Since,  HDFS  stores  all  of  the  files  in  blocks,  the  NameNode  will  keep  a  reference 
to  every  block  that  exists  in  the  file  system. 

The  Secondary  NameNode  is  there  to  backup  the  NameNode.  However,  this  is  not  a  true 
backup  in  the  sense  that  if  the  NameNode  fails;  the  Seocndary  NameNode  would  take  over. 
The  main  purpose  of  the  Secondary  NameNode  is  to  periodically  merge  the  namespace 
image  with  edit  log.  This  will  prevent  the  edit  log  from  becoming  too  large  [7].  The 
Secondary  NameNode  keeps  a  copy  of  the  merged  edit  log  and  file  system  image.  It  is 
important  to  note  that  this  is  not  real  time,  that  is  to  say  that  if  the  HDFS  is  restored  from 
the  Secondary  NameNode  image,  there  will  be  lost  data  from  the  time  difference  of  when 
NameNode  failure  occurred  and  the  last  edit  log  merge.  HDFS  does  provide  a  way  for  the 
NameNode  to  write  the  persistent  state  to  several  file  systems  to  reduce  the  chance  of  all 
the  file  systems  failing  at  once. 

The  JobTracker  is  responsible  for  accepting  jobs  and  then  dividing  the  jobs  into  tasks  and 
assigning  those  tasks  to  DataNodes  [14].  The  JobTracker  will  try  to  do  the  best  it  can 
to  maintain  data  locality,  meaning  that  it  will  try  to  give  the  tasks  to  the  DataNodes  that 
physically  have  the  blocks  of  data  the  task  is  to  be  executed  on.  The  JobTracker  will  access 
the  NameNode  to  be  able  to  determine  which  nodes  physically  have  the  data.  This  is  done  to 
cut  down  on  the  amount  of  data  that  has  to  be  transferred  over  the  network.  The  JobTracker 
will  then  contact  TaskTracker  Nodes  to  determine  whether  the  node  is  available  to  run  a 
task.  The  TaskTracker  is  the  service  running  on  the  DataNodes  that  will  communicate  with 
the  JobTracker  for  processing  tasks  it  has  been  assigned. 


21 


Figure  2.14:  FIDFS  JobTracker  Interaction 


The  DataNode  is  the  true  worker  in  HDFS.  The  DataNodes  are  what  store  and  retrieve 
blocks  of  data.  They  inform,  the  NameNode  of  the  data  blocks  they  have  and  the  JobTracker 
their  status  of  currents  jobs  running  and/or  their  availability  status.  The  DataNodes  are  also 
the  nodes  responsible  for  actually  running  the  MapReduce  code  on  their  blocks  of  data. 
The  DataNodes  will  communicate  with  other  DataNodes  when  they  need  to  send  or  share 
data.  The  direct  access  reduces  the  amount  of  traffic  needed  to  be  sent  if  the  nodes  were 
required  to  go  though  the  NameNode  to  communicate  with  another  DataNode. 
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CHAPTER  3: 

The  Experiment  Design 


The  DOD  is  continuing  to  create  and  store  all  types  of  data  [15].  The  technology  for 
how  to  deal  with  data  is  continuously  changing.  Most  current  DOD  solutions  deal  with 
storing  data  in  a  relational  database  and  use  SQL  or  Procedural  Language/Structured  Query 
Language  (PL/SQL)  to  provide  the  analytics  for  this  data.  SQL  and  PL/SQL  provide  a  large 
range  of  analytics  and  are  very  useful  for  many  applications,  but  when  the  size  of  data  to 
analyze  becomes  large,  this  approach  hits  its  limitations.  Adding  a  Big  Data  element  to  a 
relational  database  will  provide  additional  means  to  store  and  analyze  data. 


3.1  Why  Add  a  Big  Data  Element? 

The  world  is  producing  petabytes  of  data  daily  and  the  amount  of  data  being  stored  is 
increasing.  A  few  examples  of  this  are  [13]: 

•  The  New  York  Stock  Exchange  generates  about  one  TB  of  new  trade  data  per  day. 

•  Facebook  hosts  approximately  10  billion  photos,  taking  up  one  PB  of  storage. 

•  Ancestry.com,  the  genealogy  site,  stores  around  2.5  PB  of  data. 

•  The  Internet  Archive  stores  around  2  PB  of  data,  and  is  growing  at  a  rate  of  20  TB 
per  month. 

•  The  Large  Hadron  Collider  near  Geneva,  Switzerland,  will  produce  about  15  PB  of 
data  per  year 

The  large  scale  of  this  data  makes  it  difficult  to  store  and  process  the  data  in  a  relational 
database.  A  big  data  element  would  allow  some  additional  flexibility  with  storing  and 
analyzing  data. 

Specifically,  the  addition  of  an  HDFS  cluster  would  allow  quick  processing  of  the  entire 
data  set.  The  true  power  of  Hadoop  is  that  it  does  process  the  entire  data  set  [7].  This 
gives  the  ability  to  quickly  analyze  the  entire  data  set.  Hadoop  can  look  at  all  of  the  data 
in  the  Database  and  return  whatever  analysis  the  programmer  desires.  Hadoop  truly  puts 
the  power  of  all  of  the  data  in  your  corpus  at  your  fingertips.  There  is  little  to  no  concern 
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that  one  must  perform  a  large  amount  of  table  seans  to  return  analyties;  Hadoop  by  its 
very  nature  performs  a  table  sean  every  time  a  program  is  run.  That  is  its  true  power  and  it 
plaees  no  limitations  on  the  possible  analytics  that  can  be  run.  Hadoop  uses  the  MapReduce 
paradigm  and  forces  the  programmer  to  deal  with  key/value  pairs,  but  these  can  be  chained 
together  to  find  analytics  for  all  sorts  of  interesting  problems. 


3.2  Adding  a  Big  Data  Element  to  GCSS-MC 

The  current  architecture  of  GCSS-MC  does  not  have  an  architecture  to  support  big  data 
processing.  The  architecture  that  we  are  purposing  in  this  thesis  shows  how  it  can  be 
added  as  an  additional  element  and  integrate  with  the  current  architecture  as  long  as  it  has 
Internet  Protocol  (IP)  connectivity.  The  work  we  did  in  this  thesis  shows  how  a  separate 
HDFS  cluster  can  interact  with  separate  database.  We  did  not  have  access  to  the  GCSS-MC 
Database  so  we  used  an  Oracle  Database  to  simulate  the  GCSS-MC  Database.  Figure  3.1 
shows  the  setup  the  we  used  for  our  experimentation. 


10  Node  HDFS  Cluster 
Virtualized  on  2  Machines 

Figure  3.1:  Experiment  Architecture 
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This  experiment  arehiteeture  does  abstract  away  some  of  the  difficulty  that  would  be  expe¬ 
rienced  fully  integrating  a  HDFS  cluster  into  the  GCSS-MC  system.  However,  we  believe 
that  it  does  fully  support  the  proof  of  concept  that  it  can  be  done  and  done  effectively 
without  disrupting  the  current  system.  During  the  setup  and  testing  of  this  thesis  the  only 
stipulation  that  we  found  for  the  cluster  to  work  effectively  was  that  it  had  to  have  IP  con¬ 
nectivity  to  the  Oracle  Database. 

The  architecture  uses  is  all  open  source  technologies  which  are  readily  available  to  ev¬ 
eryone.  We  choose  to  use  Ubuntu  12.04  LTS  for  the  operating  system  to  run  the  Hadoop 
cluster  and  Oracle  IIG  as  the  Relational  Database  to  access  and  write  to  because  that  is 
what  the  GCSS-MC  system  is  currently  using  for  their  Database.  The  choices  of  the  op¬ 
erating  system  and  type  of  database  can  be  changed  and  adapted  to  many  other  situations 
with  a  few  modifications.  That  is  all  of  the  software  that  was  needed  to  run  the  experiment. 
There  are  several  other  Big  Data  abstractions  and  tools  that  can  be  used,  but  we  felt  that 
keeping  the  experiment  to  the  core  HDFS  was  important  to  show  this  proof  of  concept  and 
limit  the  amount  of  software  needed  to  make  the  experiment  functional. 


3.3  Building  a  Hadoop  Cluster 

We  considered  several  possibilities  discussions  on  the  the  best  way  forward  to  employ  the 
Hadoop  cluster.  The  initial  reaction  was  to  simply  run  several  virtual  machines  on  a  large 
server  to  represent  a  cluster.  But  this  thinking  goes  against  the  idea  of  using  commodity 
machines  linked  together  to  gain  additive  power  [6].  So  we  decided  to  run  a  simple  exper¬ 
iment  on  the  a  single  Hadoop  machine  versus  a  virtualized  two  node  Hadoop  cluster.  The 
experiment  ran  several  benchmarks  on  the  two  different  setups  and  found  that  the  speed  up 
was  hampered  by  the  virtualized  environment.  Our  conclusions  came  down  to  no  matter 
how  many  virtual  machines  were  running  ultimately  they  all  had  to  read  and  write  data  to 
the  same  hard  disk,  which  subsequently  causes  a  bottleneck.  We  found  some  other  inter¬ 
esting  factors  that  can  be  attributed  to  slower  performance  in  a  Hadoop  virtual  environment 
as  well,  the  full  paper  can  be  found  in  the  Appendix.  There  has  been  a  lot  of  research  in 
optimizing  Hadoop  in  the  virtual  computing  environment  that  has  found  tunable  settings  in 
Hadoop  and  the  virtual  hypervisor  that  can  give  the  same  performance  as  hardware  [16]. 

Despite  our  findings  we  decided  to  run  a  ten  node  cluster  on  two  servers.  This  was  largely 
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decided  due  to  the  available  recourses  and  our  aim  to  show  a  proof  of  concept  (and  not 
focous  performance).  Once  we  decided  on  our  way  forward  we  began  to  build  the  cluster 
that  would  be  used  for  the  experiment.  The  cluster  was  built  on  top  of  two  servers  running 
Ubuntu  12.04  LTS.  The  virtual  environment  that  we  choose  is  Oracle’s  VirtualBox.  The 
Nodes  were  split  up  on  the  two  servers  with  seven  nodes  on  server  one  and  three  on  server 
two.  The  split  was  decided  based  on  the  available  random  access  memory  (RAM)  on  each 
server. 


Table  3.1:  Experiment  Server  Specifications 


Specifications 

Server  #1 

Server  #2 

OS 

Ubuntu  12.04  (64  Bit) 

Ubuntu  12.04  (64  Bit) 

CPU(s) 

4 

4 

CPU  MHZ 

1500 

1600 

HDTB 

2 

1 

RAM  GB 

32 

16 

#  of  HDFS  Nodes 

7 

3 

Once  the  the  severs  and  virtual  environment  was  set  up,  we  built  the  10  virtual  machine 
(VM)s  on  the  the  servers.  The  next  step  was  to  start  and  create  the  Hadoop  nodes  on 
the  VMs.  The  Hadoop  build  was  done  using  Tom  White’s  Hadoop:  The  Definitive  Guide 
[7]  and  Michael  Noll’s  Hadoop  Tutorials  [17].  Each  node  was  built  and  tested  separately 
prior  to  adding  them  to  the  cluster.  This  process  could  have  expedited  up  by  cloning  the 
machines,  however,  we  wanted  to  ensure  the  integrity  of  each  machine  and  the  cluster. 

Table  3.2:  Experiment  HDFS  Node  Specifications 


Specifications 

HDFS  Nodes 

OS 

Ubuntu  12.04  (64  Bit) 

CPU(s) 

1 

HD  GB 

100 

RAM  GB 

4 

The  process  of  installing  Hadoop  on  a  VM  is  not  all  that  different  from  installing  any 
software  package.  The  Hadoop  install  is  downloaded  as  a  compressed  file  (.tar.gz).  There 
are  a  few  things  that  you  must  do  as  a  prerequisite  to  the  Hadoop  install.  First,  you  must 
ensure  that  you  have  an  up-to-date  version  of  Java  (our  testing  has  found  Hadoop  works 
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with  Java  6  or  Java  7).  You  must  also  configure  SSH  on  the  VM  in  order  to  allow  Hadoop 
to  communicate.  Once  this  is  done  you  have  to  edit  several  configuration  files  to  set  up 
the  environment  and  then  you  are  off  and  running.  The  configuration  files  are  also  used  to 
control  and  optimize  the  cluster.  See  the  appendix  for  full  installation  instructions. 
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CHAPTER  4: 
The  Experiment 


This  chapter  will  focus  on  two  things:  First  how  the  data  is  taken  from  and  loaded  into 
the  Hadoop  ecosystem,  then  how  the  data  is  used  in  two  separate  analytic  programs.  The 
first  program,  is  a  home  grown  analytic  tool  that  uses  the  GCSS-MC  data  to  find  the  100 
most  frequent  NSNs  in  the  whole  data  set  and  then  build  tables  based  on  the  top  100  NSNs 
found.  The  second  program  is  a  simulated  SQL  program  that  is  taken  from  a  similar  project 
that  is  asking  for  SQL  analytics  on  the  same  data  set.  The  program  runs  on  the  data  and 
produces  the  same  results  that  SQL  would  produce.  The  second  program  is  done  to  show 
that  Hadoop  can  simulate  anything  done  in  an  SQL  environment.  We  believe  that  this 
makes  a  strong  case  that  Hadoop  can  be  used  to  provide  analytics  on  both  structured  and 
unstructured  data  sets. 

4.1  The  GCSS-MC  Data 

The  GCSS-MC  data  that  we  received  is  a  sample  of  actual  production  data.  This  data 
set  was  obtained  from  the  USMC  and  used  to  assist  the  thesis  experiment.  The  data  was 
first  loaded  into  an  Oracle  llg  database.  From  the  database  we  are  able  to  see  fourteen 
tables  of  GCSS-MC  data.  The  first  step  in  working  with  this  structured  data  was  to  write  a 
program  to  pull  the  data  out  of  the  relational  database  and  move  the  data  into  the  Hadoop 
ecosystem.  There  were  several  decisions  made  as  to  how  to  pull  the  data  out  and  its  format. 
The  final  choice  was  made  to  write  a  Java  program  to  access  the  Oracle  database  using  the 
Java  database  connect  libraries  and  parse  each  table  into  separate  files.  The  program  was 
written  to  parse  each  table  in  the  database  into  JavaScript  Object  Notation  (JSON)  format. 
A  sample  of  the  data  in  JSON  format  can  be  found  in  Figure  4. 1 .  The  NSNs  are  highlighted 
in  the  JSON  formatted  data. 

JSON  format  was  chosen  because  it  lends  itself  nicely  to  formatting  between  different 
databases.  Also,  a  table  can  quickly  be  built  from  the  format.  This  is  important  because 
the  goal  of  this  project  is  to  pull  all  of  the  data  from  a  relational  database  and  run  big  data 
analytics  and  ultimately  return  back  to  a  simplified  relational  database  table. 
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{ 

"XXMC_MERIT_RETAIL_INVENTORY” :  [ 

1  { "ACTIVITY_ADDRESS_CODE" :"MMC100", "BACK_ORDER_QUANTITY'' : "0" , "CONTROL_ITEM_CODE" : "null" , 
"DAY30_USAGE_RD" :"null", "DUE_PROVISICWS" : "0", "DUE_STOCK" : "0" , "EXCH": "0", 

"FIXED_LEVEL" : "0",  " FLQAT_REORDER" : "25", "FREEZE_CODE" : "Y" , "FREE2E_REASON" : "GHOST  SN", 
"FREEZE_DATE" : "2013-05-01  23:24:47.0", "GABF_DATE" : "2013-12-08  11:31:33.0", 
"IiAST_TRANSACTION_DATE" : "2013-10-15  13:24:01.0", "MATERIAL_ID_CODE" : "B" , "MOFFSET" : "0", 
"NON_SYSTEM_ID_CODE"  :  "null",  "NO_lST_RECEIPT"  :  "null"  ,  "OH_PROVISIONS"  :  "0"  , 

"OH  STOCK  SERVICEABLES" : "40", "OH_UNSERVICEABLES" : "0", "PHRASE_CODE" :"4", _ 

i"PRTME_NSN"  :  "4330010463399^,  "NCMENCLATURE"  :  "FILTER  ELEMENT,  FLUI "  ,  ["RECORD_NSN"  :  " 4330010463399"  ,| 
"REORDER_DATE" : "2013-12-05  06:07:28.0", "REORDER_POINT" : "25", "REQUISITION_OBJECTIVE" : "33", 
"ROUTING_ADDRESS_CODE": "MMC300" , "ROUTING_IDENTIFIER_CODE" : "MCI",  "SEC_CODE" : "U", 

"SPL_ALLOW"  :  "0",  "STORE_ACCCiUNT_CODE"  :  "1",  "SUPPLY_SOURCE_CODE"  :  "null",  "TOTAL_MO_ALLOW"  :  "0", 
"UNIT_OF_ISSUE" : "EA" , "UNIT_PRICE" : "7 .34", "PROCESS_STATUS" : " Y" , "RECCSID_ID" : "73906115", 
"CREATED_BY": "1277", "CREATION_DATE" : "2013-12-08  11:31:33.0", "LAST_UPDATED_BY" : "1277", 
"LAST_UPDATE_DATE" : "2013-12-08  11:31:33.0",  "REQUEST_ID" : "27221076" , "BATCH_ID" : "GENEP54431990" , 
"EXTERNAL_APPLICATION": "merit", "REQUIREMENT_CODE" : "IFUC", "OPERATION_CODE" : "61", 

"IIP_QUftNTITY"  :  "0"  } , 


Figure  4.1:  Sample  GCSS-MC  data  in  JSON  Format 


There  are,  however,  some  eompromises  that  are  made  when  using  JSON  format.  When 
JSON,  is  used  there  is  going  to  be  some  inerease  in  the  size  it  takes  to  store  that  data.  In  our 
ease  the  Oraele  database  is  approximately  1.25  gigabyte  (GB)  and  when  that  data  is  taken 
out  of  the  database  and  parsed  into  JSON  it  beeomes  5.3  GB.  That  is  a  data  inerease  faetor 
of  4.24.  This  experiment  ean  handle  that  level  of  storage  inerease,  but  that  is  not  the  ease  as 
the  data  beeomes  larger  than  1  TB.  We  diseussed  this  at  length  and  ehose  to  move  forward 
with  JSON  with  the  understanding  that  this  would  have  to  ehange  to  inerease  the  seale  of 
the  data.  The  main  reason  for  us  to  maintain  JSON  is  beeause  keeping  the  meta-data  in  the 
data  provides  more  flexibility  in  transforming  and  exporting  the  data  dynamieally. 

Another  option  would  have  been  to  use  a  different  format  for  the  data  that  would  reduee 
the  size  requirements.  For  example,  the  data  eould  have  been  pulled  from  the  database  and 
parsed  into  Comma  Separated  Value  (CSV)  format.  The  CSV  format  would  reduee  the 
data  blow  up  faetor  signifieantly.  In  the  data  set  for  this  thesis,  the  inerease  faetor  for  CSV 
would  be  1.176,  whieh  is  4.5  times  less  than  JSON  format.  Additionally,  the  data  eould 
be  redueed  when  the  database  has  a  null  value.  For  this  dataset  we  did  not  reduee  the  null 
values  out  of  the  data.  Again,  the  main  reason  to  keep  the  null  value  was  to  provide  the 
most  flexibility  to  the  analyties  in  the  future. 

There  has  to  be  a  great  deal  of  time  spent  on  how  to  deal  with  the  data  when  the  possibility 
for  the  data  to  exeeed  TBs  exists.  The  larger  the  data  set,  the  more  effort  must  be  spent  in 
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minimizing  the  size  of  the  data.  Especially  when  the  Hadoop  ecosystem  will  increase  the 
storage  requirement  of  any  dataset  by  three.  This  is  due  to  the  data  replication  factor  set 
for  Hadoop.  The  data  replication  factor  is  tunable  but  if  reduced  below  three,  the  cluster 
will  be  more  susceptible  to  faults  and  may  decrease  speed  of  the  programs  due  to  the  larger 
network  overhead  of  having  to  send  the  data  to  a  node  to  run  the  computation.  The  Hadoop 
NameNode  will  try  to  schedule  the  data  computation  on  an  idle  node  where  the  data  resides. 
Therefore,  if  the  replication  factor  is  decreased  then  the  chances  of  finding  an  idle  node 
that  has  the  data  decreases.  Thus  the  NameNode  will  have  to  schedule  the  computation  to 
another  node  and  send  that  node  the  data  resulting  in  the  overhead  of  additional  scheduling 
message  as  well  as  the  actual  sending  of  the  data  on  the  network.  Hadoop  can  handle  large 
amounts  of  data  and  the  main  limit  on  the  size  of  the  data  Hadoop  can  handle  is  the  physical 
limitations  of  the  machines  on  which  it  is  deployed.  As  an  illustration  to  the  possibility  of 
a  Hadoop  cluster;  in  2010,  Facebook  had  a  Hadoop  cluster  that  was  2,000  nodes  and  had  a 
storage  capacity  of  twenty-one  PB  [18]. 

Once  all  of  the  decisions  were  made  on  how  to  parse  the  GCSS-MC  database,  the  next 
step  was  importing  the  data  into  the  actual  Hadoop  ecosystem.  We  chose  to  simply  use  the 
Hadoop  file  manipulations  commands  to  ingest  the  JSON  feed  into  the  the  Hadoop  ecosys¬ 
tem.  The  ingest  is  ultimately  accomplished  with  a  bash  script  executed  on  the  NameNode. 
The  bash  script  executes  the  Java  program  to  pull  the  data  from  the  Oracle  database  and 
then  executes  the  Hadoop  file  system  command  to  import  the  data.  We  experimented  with 
Squoop  to  do  this  and  were  pleased  with  the  results.  Squoop  is  an  Apache  Hadoop  tool  that 
supports  importing  and  exporting  of  database  data  into  Hadoop.  However,  because  Squoop 
did  not  support  the  JSON  format  we  needed,  we  wrote  a  Java  program  that  executes  Java 
Database  Connectivity  (JDBC)  calls  to  the  database  and  then  generates  the  JSON  format. 
The  program  is  designed  to  make  the  JDBC  and  call  to  the  database  and  create  the  JSON 
format  for  each  row  on  a  separate  line  in  a  text  file.  Each  row  of  data  in  the  database  rep¬ 
resenting  one  line  in  the  text  file  is  important  because  of  how  Hadoop  reads  the  file  in  a 
MapReduce  algorithm.  For  instance,  if  the  program  spread  a  row  of  data  beyond  one  line 
in  the  value,  it  would  be  near  impossible  to  write  a  MapReduce  program  to  recreate  the 
database  row.  This  is  due  to  Hadoop  MapReduce  handling  each  line  of  input  data  sepa¬ 
rately,  which  is  extremely  important  to  ensure  the  program  can  be  split  up  and  executed  in 
parallel.  Of  course,  we  could  write  another  program  to  handle  that  and  recreate  the  row. 
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but  that  would  incur  a  large  overhead  eost.  Beeause  we  wrote  the  program  that  pulls  that 
data,  it  gives  us  the  power  to  eontrol  how  the  data  is  handled.  We  do  not  mean  to  say  using 
additional  tools  like  Squoop  or  another  third  party  tool  is  a  bad  thing,  just  that  in  building 
our  program  we  wanted  to  ensure  we  eontrolled  the  data  at  eaeh  stage  and  writing  our  own 
program  to  handle  that  was  the  best  solution  in  our  experiment.  Additionally,  we  deeided  to 
try  and  keep  the  smallest  footprint  of  software  that  was  needed  to  run  this  experiment.  We 
thought  that  if  we  ean  ereate  all  of  the  programs  we  needed,  we  would  better  understand 
the  data  flow  and  thus  better  understand  how  to  use  the  tools  that  abstract  the  lower  level 
eoding. 


4.2  Top-lOO-NSN  Program 

The  first  program  we  designed  and  developed  is  the  NSN  program.  This  program  seans  all 
13  files  pulled  from  the  database,  and  finds  the  top  100  most  frequently  oeeurring  NSNs. 
Then  it  pulls  the  data  assoeiated  with  those  NSNs  and  ereates  a  SQL  statement  to  write 
the  data  baek  to  the  database.  The  idea  of  this  program  was  to  sean  a  large  dataset  and 
then  ereate  smaller  tables  that  eontain  only  speeifieally  defined  needed  data.  During  the 
ereation  of  this  program  we  made  the  ehoiees  as  to  what  data  to  return.  Although,  we  made 
educated  guesses  on  what  a  Supply  Offleer  would  deem  important,  that  is  not  the  true  proof 
of  importanee  of  what  we  were  able  to  show.  The  fact  that  the  data  is  returned  is  what 
is  important.  The  ehoiee  of  what  subset  of  data  is  returned  does  not  matter;  that  ean  be 
ehanged  in  the  program  and  then  whatever  data  a  eustomer  wants  ean  be  delivered. 

In  order  to  get  the  desired  result  the  program  had  to  be  written  to  exeeute  several  MapRe- 
duee  jobs  ehained  together.  In  this  program  we  had  to  write  five  separate  MapReduce 
algorithms  in  order  to  aehieve  our  desired  result.  The  first  MapReduce  algorithm  simply 
eounts  the  oeeurrences  of  NSNs.  The  seeond  algorithm  sorts  the  NSNs  in  deseending  order 
of  greatest  frequeney.  The  next  algorithm  finds  all  of  the  data  assoeiated  with  the  NSNs. 
Then  the  final  two  algorithms  ereate  the  ISON  and  the  SQL  statements  and  write  the  data 
to  the  Oraele  database.  Eaeh  one  of  the  algorithms  was  individually  ereated  and  tested. 
Onee  all  of  the  algorithms  were  eompleted,  we  paekaged  them  up  into  a  single  MapReduce 
job  to  run. 

The  flow  of  the  entire  program  is  illustrated  in  Figure  4.2.  The  figure  gives  a  graphie 
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representation  of  the  data  flow  through  the  whole  program  by  illustrating  the  data  at  eaeh 
algorithm.  The  inputs  are  on  the  left  side  of  the  algorithm  boxes  and  the  output  at  the  right 
side.  The  inputs  to  the  top  of  the  box  are  used  by  the  algorithm  in  the  eonfigure  method  of 
the  map  phase. 
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4.2.1  Top-lOO-NSN  Program:  First  Algorithm 

The  first  MapReduce  algorithm  in  the  NSN  program  finds  and  counts  all  of  the  NSNs  in 
the  entire  corpus.  This  algorithm  is  similar  to  a  word  count  algorithm. 


Input 

All  Database  Tables 


{ 

"XXMC_HERIT_RETAIL_INVENTORY":  [ 

1  {"ACTIVITY_ADORESS_COOE":"MHC100","BACK_ORDER_QUANTITY-:“0","CONTROL_IT£M_COOE":"null", 
"OAY30_USA6E_RD“:"null“,"[XJE_PROVISIONS":“0‘T“[XIE_STOCK":"0“/'EXCH“:'0“,  "FIXED_LEVEL": “0", 
-FL0AT_RE0R0ER“ : “2S“, “FR£E2E_C00E" : "Y”, “FREE2E_R£ASCW : "GHOST  SN", “FRE£2E_WT£" : 

"2013-05-01  23:24:47.0","GA8F_OATE": "2013-12-08  11:31:33.0","LAST_TRANSACTION_OAT£-: 
"2013-10-15  13:24:01.0-,~MATERIAL_ID_COO£":"B","HOFFSET":"0","MON_SYSTEM_ID_COOE":"r>ull“, 
•■N0_1ST_RECEIPT’ :  "null" ,  ■•OH_PROVISIONS" :  "0" ,  "OH_STOCK_SERVICEABLES" :  "40",  "0H_UNSERV1CEABL£S" : 
“0","PHRASE_COOE": "4", "PRIHE_NSN": "4330010463399", "NOMENCLATURE": "FILTER  ELEMENT, FLUI", 
"RECORO_NSH":"4330010463399","REOROER_DATE": "2013-12-05  06:07:28.0", "REORDER_POINT":"25", 
"REQUISITION_O83ECTIVE":"33","ROUTING_ADOR£SS_COOE":"WC300",“ROUTING_IO£NTIFI£R_COOe":"HCl", 
"SEC.COOE" : "U" , "SPL_ALL0W" : "0", "ST0R£_ACC0UNT_C00E" : "1", "SUPPLY_SOURC£_COD£" : "null", 
"TOTAL.MO.ALLOH" : "0", "UNIT_OF_ISSUE" : "EA", "UNIT^PRICE": "7. 34", "PROCESS.STATUS”: "Y", 
''RECOR5_ID":"73906115","CRiATiD_BY":"1277’,"CREATION_DATE":"2013-12-08  11:31:33.0", 
"LAST_UPDATEO_BY":"1277",”LAST_UPOATE_DATE": "2013-12-08  11: 31:33.0", "R£QUEST_I0" :"27221076", 
“8ATCH_IO":"GENEP54431990","EXTERNAL_APPLICATION":"iierit","REQUIREMENT_COOE":"lFUC", 
"OPERATION_COOE" : "61" , " IIP_QUAMTITY“ : "0"}, 


1240015251648 

645 

Output 

6140014851472 

630 

Key/Value:  NSN,  Frequency 

1005013832872 

579 

Ascending  Order 

5855014320524 

543 

1005012310973 

465 

Figure  4.3:  NSN:  First  Algorithm 


The  major  difference  is  the  "word"  that  is  counted  is  limited  to  an  NSN.  The  map  phase 
scans  all  thirteen  tables  and  outputs  only  a  valid  NSN.  The  reduce  phase  will  then  sum 
up  all  of  the  occurrences  of  the  NSNs  and  produce  the  output.  This  step  is  the  first  in  the 
analysis  process  and  is  needed  to  find  and  output  the  most  frequent  NSNs  in  the  data  corpus. 
The  data  input  to  the  algorithm  is  similar  to  the  data  in  Figure  4. 1  and  the  highlighted  values 
are  the  NSNs  we  are  searching  for  in  this  algorithm.  The  output  sample  is  shown  in  Figure 
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4.4. 


1240015251648 

645 

6140014851472 

630 

1005013832872 

579 

5855014320524 

543 

1005012310973 

465 

Figure  4.4:  NSN  Output  from  the  First  Algorithm 


4.2.2  Top-lOO-NSN  Program:  Second  Algorithm 

The  next  step  is  to  sort  the  NSNs  by  frequeney.  Hadoop  will  always  provide  some  order  on 
the  output  from  the  reduee  phase.  The  order  is  given  based  on  the  value  of  the  key  of  the 
key/value  pair.  Hadoop,  by  default,  will  order  the  output  of  the  reduee  phase  in  a  alphabetie 
and  numerie  aseending  order. 

In  this  algorithm  we  want  to  have  the  reduee  phase  produee  an  output  that  will  be  ordered 
by  the  frequeney  of  NSN  oeeurrenees  in  deseending  order.  This  has  to  be  done  in  a  separate 
algorithm  than  the  first.  If  we  attempt  to  order  in  the  first  algorithm  before  it  eompletes, 
then  the  sum  of  the  NSN  oeeurrenees  will  not  be  aeeurately  produeed.  Now  we  have  the 
output  of  the  first  algorithm  as  the  input  into  the  seeond  algorithm.  There  are  two  major 
obstaeles  to  overeome  to  produee  the  desired  output.  First  we  have  to  override  the  output 
key  eomparator  elass  to  produee  a  key  output  in  deseending  order.  This  proeess  oeeurs  in 
the  eombining  and  shuffling  steps  of  Hadoop.  The  eode  ereated  to  perform  this  deseending 
order  sort  is  shown  in  Figure  4.6.  The  method  overrides  the  method  Hadoop  ealls  to 
eompare.  This  method  is  a  reeursive  funetion  that  transposes  the  order  by  multiplying 
everything  by  a  negative  one. 

The  seeond  obstaele  is  that  we  need  to  transpose  the  key /value  pairs  sueh  that  the  frequeney 
value  of  NSNs  is  now  the  key  and  the  value  beeomes  the  NSN.  This  is  done  to  produee  an 
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1240015251648 

645 

Input 

6140014851472 

630 

Key/Value:  NSN,  Frequency 

1005013832872 

579 

Ascending  Order 

5855014320524 

543 

1005012310973 

465 

Algorithm  2 


Output 

645 

1240015251648 

630 

6140014851472 

Key/Value:  NSN,  Frequency 

579 

1005013832872 

Descending  Order 

543 

465 

5855014320524 

1005012310973 

Figure  4.5:  NSN:  Second  Algorithm 


output  that  will  be  ordered  in  descending  order  based  on  the  frequency  of  occurrence  of  a 
particular  NSN.  The  transposing  of  the  key/value  pair  is  accomplished  in  the  map  phase. 
Then  the  shuffling  and  combing  phase  will  produce  the  input  to  the  reduce  phase  with  the 
NSN  frequency  as  the  key  and  the  NSN  itself  the  value.  A  sample  of  the  output  of  the 
algorithm  can  be  seen  in  Figure  4.7. 
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static  class  Re verseComparator  extends  WritableComparator  { 

private  static  final  Text  .  Comparator  TEXT_COMPARATOR  =  new  Text  .  Comparator  () ; 

public  ReverseComparator  ( )  { 
super(Text.  class  ); 

1 


@  Override 

public  i  11 1  compare  (byte[]  bl,  int  si,  int  11,  byte[]  b2, 
return  (-1)*  TEXT_COMPARATOR 

.  compare  (bl,  si,  11,  b2,  s2,  12); 


int  s2  . 


int  12)  { 


Figure  4.6:  Descending  Order  Sort  Class 


645  1240015251648 
630  6140014851472 
579  1005013832872 
543  5855014320524 
465  1005012310973 

Figure  4.7:  NSN  Output  from  the  Second  Algorithm 


4.2.3  Top-lOO-NSN  Program:  Third  Algorithm 

The  output  from  the  seeond  algorithm  enables  us  to  seareh  and  find  all  of  the  data  assoeiated 
with  the  100  most  frequently  oeeurring  NSNs.  In  order  to  faeilitate  using  the  output  from 
the  seeond  algorithm  we  used  the  distributed  eaehe  funetionality  that  is  part  of  Hadoop. 
The  distributed  eaehe  allows  the  programmer  to  aeeess  other  files  on  the  HDFS.  Then  we 
had  to  deeide  what  type  of  data  strueture  to  use  to  build  with  the  output  from  algorithm  two. 
After  some  initial  testing  we  found  that  using  a  hash  map  provided  the  faster  eomparisons 
than  a  map  or  a  pattern.  Hadoop  also  provides  the  eonfigure  method  to  use  as  means  to  set 
up  data  struetures  that  will  be  needed  in  the  map  phase.  The  eonfigure  method  allow  the 
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programmer  to  build  data  structures  that  can  be  used  in  the  map  phase  of  the  algorithm. 
This  is  critical  because  each  map  instance  will  have  to  build  the  data  structure  exactly  the 
same  in  order  to  achieve  predictable, repeatable  results.  In  this  particular  configure  method 
we  create  a  hash  map  with  the  first  100  NS  Ns  from  the  output  of  algorithm  two. 


Input 

All  Database  Tables 


{ 

"XXMC_MERIT_RETAIL_IKVEMTORY":  [ 

1  {"ACTIVITY_ADDRESS_COOE“:"MMC100~,“BACK_O«OER_QUANTITY":"0-,"CONTROL_IT£M_COOE":"null“, 
"OAY30_USAGE_RD“:“null",“DUE_PROVISIONS":“0",“DUE_STOCK":"0~,"EXCH":“0“,  “FIXED_LEVEL":“0", 
“FL0AT_RE0R0ER- : '•25“/'FREE2E_CODE“ : “Y%"FREE2e_REAS0N“ : “GHOST  SN“,"FREE2e_DATE“ : 

"2013-05-01  23: 24:47.0”, "GABF_DATE“: "2013-12-08  11:31: 33. 0","LAST_TRANSACTION_DATE“: 
"2013-10-15  13: 24:01. 0”,"HftTERIAL_ID_COOE":“0","MOFFSET":"0","NON_SYSTEM_ID_CODE“: "null", 
•■N0_1ST_RECEIPT" : "null" , "0H_PR0VISI0NS" : "0" , "OH_STOCK_SERVICEABLES" : "40", "OH_UNSERVICEABLES" : 
"0“,"PHRASE_COOE": "4", “PRIHE.NSN": “4330010463399", "NOMENCLATURE": "FILTER  ELEMENT,FLUI", 
"RECORD.NSN": "4330010463399", "REORDER.DATE": "2013-12-05  06:07:28.0", "REORDER_POINT": "25", 
"REQUISITION_Oe3ECTIVE":"33","ROUTING_ADORESS_CODE":’HHC300","ROUTIN6_IOENTIFIER_COOE":"MCl", 
"SEC_C00£" :  "U",  "SPL.ALLOW" :  "0",  "STORE_ACCOUNT_CXE" :  "1",  "SUPPLY_SOURCE_CODE" :  "null", 
"T0TAL_M0>LL0W" : "0", "UNIT_OF_ISSUE" : "EA", "UNIT.PRICE" : "7. 34", "PROCESsIsTATUS" : "Y", 
"RECORD_ID": "73906115", "CREATED_BY": "1277", "CREATION_OATE": "2013-12-08  11:31:33.0", 
"LAST_UPDATED_BY“:“1277","LAST_UPDAT£_0ATE": "2013-12-08  11: 31:33. 0","REQUEST_ID": "27221076", 
"BATCH_ID“ : "GENEP54431990" , "EXTERNAL_APPLICATION" : "aerit", "REQUIREMENT_C00E" : “IFUC", 
"OPERATION_CXE" :  "61“,":iP„QUANTITY“ :  "0"}, 


Configure 

Key/Value:  NSN,  Frequency 
Descending  Order 

645  1240015251648 
630  6140014851472 
579  1005013832872 
543  5855014320524 
465  1005012310973 


Output 

Key/Value:  NSN,  JSON  where  NSN  occurs 


1005011182640 

{'•UNIT. 

_NAME" 

"null" 

"TAMCN" 

"null"  . . . 

1005011182640 

{ "UNIT_ 

_NAME" 

"null" 

"TAMCN" 

"null"  ... 

1005011182640 

f'UNIT, 

_NAME" 

"null", "TAMCN" 

"null"  ... 

1005011182640 

{"UNIT_ 

_NAME" 

"null", "TAMCN" 

"null"  ... 

1005011182640 

{"UNIT 

NAME" 

"null" 

"TAMCN" 

"null"  . .  . 

Figure  4.8:  NSN:  Third  Algorithm 


The  map  phase  will  then  scan  all  of  the  data  in  the  corpus  and  produce  an  output  from 
the  map  phase  if  and  only  if  the  NSN  is  found  in  the  line  of  data.  The  line  is  output  in 
JSON  format  with  the  NSN  and  the  source  file  appended  to  the  front  of  the  JSON.  The 
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reduce  phase  of  this  algorithm  is  the  identity  reduce.  It  simply  outputs  the  key /value  pairs 
it  receives  from  the  map  phase.  A  sample  of  the  output  can  be  found  in  Figure  4.9. 


1005011182640 

1005011182640 

1005011182640 

1005011182640 

1005011182640 


{ "UNIT_NAME" : "null", "TAMCN" : "null" 
{ "UNIT_NAME" : "null", "TAMCN" : "null" 
{ "UNIT_NAME" : "null", "TAMCN" : "null" 
{ "UNIT_NAME" : "null", "TAMCN" : "null" 
{ "UNIT_NAME" : "null" , "TAMCN" : "null" 


Figure  4.9:  NSN  Output  from  the  Third  Algorithm 


} 

} 

} 

} 

} 


4.2.4  Top-lOO-NSN  Program:  Fourth  Algorithm 

The  next  step  is  to  use  the  output  from  the  third  algorithm  and  parse  the  format  down  to  the 
JSON  that  we  wish  to  write  back  to  the  database. 

This  algorithm  and  the  next  one  could  be  reduced  to  one  that  generates  the  JSON  and  cre¬ 
ates  the  SQL  statement  to  write  to  the  database  using  a  JDBC  call.  Ultimately,  we  decided 
to  keep  them  separate  so  that  we  could  use  the  JSON  format  in  the  future  to  regenerate 
the  tables,  if  necessary.  Furthermore,  to  maintain  the  flexibility  to  adapt  to  other  tools,  like 
a  database  Hadoop  connector,  we  felt  it  prudent  to  keep  the  JSON  formating  code.  This 
algorithm  takes  the  output  from  the  third  algorithm  and  parses  it  down  into  a  JSON  format 
to  write  back  to  the  database.  A  sample  of  the  output  of  the  algorithm  can  be  seen  in  Figure 
4.11. 

At  this  point  we  made  some  decisions  about  what  data  to  write  back.  Table  4. 1  illustrates 
the  data  that  we  write  back  to  the  database.  At  this  point  the  data  choice  can  be  modified 
and  tailored  to  exactly  what  a  customer/stakeholder  would  desire.  The  choices  were  made 
just  show  a  proof  of  concept  that  the  data  can  be  found  and  written  back  to  a  database  to 
show  a  smaller  amount  of  data  that  is  desired.  However,  any  of  the  data  that  exists  in  the 
original  database  can  be  retrieved  with  only  minor  modifications  to  the  code  base.  The 
benefit  of  this  type  of  analytic  is  truly  realized  as  the  data  size  increases,  the  customer  can 
keep  all  of  their  data  and  produce  tables  of  just  the  pertinent  data  required  for  a  particular 
need. 
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1005011182640 

{ "UNIT, 

,NAME" 

:"null", "TAMCN" 

: "null"  . . . 

Input 

1005011182640 

("UNIT, 

NAME" 

: "null", "TAMCN" 

: "null"  . . . 

1005011182640 

( "UNIT 

NAME" 

:"null", "TAMCN" 

: "null"  . . . 

Key/Value;  NSN,  JSON  where  NSN  occurs 

1005011182640 

( "UNIT, 

_NAME" 

:"null", "TAMCN" 

: "null"  . , . 

1005011182640 

("UNIT, 

_NAME" 

: "null", "TAMCN" 

; "null"  . . . 

Output 

Key/Value;  NSN,  New  JSON 
Formatted  to  new  output  table 


("NSN";"100S011182640“ 
{ "NSN" : "1005011182640" 
(••NSN":"1005011182640" 
( "NSN" : "1005011182640" 
1 "MSN": "1005011182640" 


Table  Name: 
Table  Name: 
Table  Name: 
Table  Name: 
Table  Name: 


XXMC_R0O 1_ALL0WANCES_TBL . TXT  , "UN IT_NAME " 
XXMC_R001_ALLOWANCES_TBI..TXT  ,  "UNIT_NAME" 
XXMC_R001_DUEIN_TBL . TXT  , "ORDER_NUMBER" : 
XXMC^ROOI^DUEIN^TBL . TXT 
XXMC  ROOl  DUEIN  TBL.TXT 


, "ORDER^NUMBER": 
"ORDER  NUMBER"; 


Figure  4.10:  NSN:  Fourth  Algorithm 


{ "NSN" 
{"NSN" 
{ "NSN" 
{"NSN" 
{"NSN" 


"1005011182640" 

"1005011182640" 

"1005011182640" 

"1005011182640" 

"1005011182640" 


Table  Name: 
Table  Name: 
Table  Name: 
Table  Name: 
Table  Name: 


XXMC_R001_ALLOWANCES_TBL.TXT  , "UNIT_NAME" : "NULL"  . 
XXMC_R001_ALLOWANCES_TBL.TXT  , "UNIT_NAME" : "NULL"  . 
XXMC_R001_DUEIN_TBL.TXT  , "ORDER_NUMBER" : "1041015" 
XXMC_R001_DUEIN_TBL.TXT  , "ORDER_NUMBER" : "1596718" 
XXMC_R001_DUEIN_TBL.TXT  , "ORDER_NUMBER" : "616556" 


} 

} 


Figure  4.11:  NSN  Output  from  the  Fourth  Algorithm 


: "NULL" 

: "NULL" 
"1041015 
"1596718 
"616556" 
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Table  4.1:  Tables  Written  Back  to  Database 
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4.2.5  Top-lOO-NSN-Program:  Fifth  Algorithm 

The  final  algorithm  for  this  program  performs  the  MapReduee  funetion  that  makes  the 
JDBC  eall  to  write  the  data  back  to  the  database.  This  algorithm  takes  the  formated  JSON 
output  from  the  previous  algorithm  and  creates  an  SQL  statement  to  write  the  data  to  the 
database. 


Input 

Key/Vatue:  NSN,  New  JSON 
Formatted  to  new  output  table 


1  "NSN"  100501 11B2640" 
1 "NSN";" 100501 11B2640" 
1 "NSN”: "1005011182640" 
("NSN": "1005011182640" 
("NSN": "1005011182640" 


Table  Name : 
Table  Name: 
Table  Name : 
Table  Name: 
Table  Name: 


XXMC_R001_ALLOWANCES_TBL.TXT  , "UNIT_NAME" : "NULL"  .. 
XXMC_R001_ALLOWANCES_TBL.TXT  , "UNIT_NAME" : "NULL"  .. 
XXMC_R00 l_DUEIN_TBL . TXT  , "ORDER_NUMBER" : " 1041015" 
XXMC_R001_DUEIN_TBL.TXT  , "ORDER_NUMBER" :" 1596718" 
XXMC_R001_DUEIN_TBL.TXT  , "ORDER_NUMBER" : "616556"  . 


) 

1 


Output 

Key/Vaiue:  NSN,  SQL  Insert  Statement 


6135009857845  INSERT 
6135009857845  INSERT 
6135009857845  INSERT 
6135009857845  INSERT 
6135009857845  INSERT 

Figure  4.12:  NSN: 


INTO  HDFS_R001_ALLOWANCES_NSN  (NSN,  SOURCETBL, 
INTO  HDFS_R001_ALLOWANCES_NSN  (NSN,  SOURCE_TBL, 
INTO  HDFS_R001_ALLOWANCES_NSN  (NSN,  SOURCETBL, 
INTO  HDFS_R001_ALLOWANCES_NSN  (NSN,  SOURCE_TBL, 
INTO  HDFS_R001_ALLOWANCES_NSN  (NSN,  SOURCE_TBL, 

Fifth  Algorithm 


The  JDBC  connection  is  made  in  the  configure  method  of  the  map  phase.  The  map  phase 
then  creates  the  statement  and  makes  the  JDBC  call  to  execute  the  statement.  The  output  of 
the  map  class  is  the  NSN  and  SQL  statement.  The  reduce  phase  in  the  identity  reducer  and 
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simply  outputs  what  it  receives.  The  reduce  phase  could  be  eliminated  and  the  output  taken 
directly  from  the  Map  phase,  which  would  speed  up  the  algorithm  execution.  We  decided 
to  keep  the  reduce  phase  to  sort  the  map  output.  A  sample  output  of  the  algorithm  can  be 
seen  in  Figure  4.13. 


6135009857845 

6135009857845 

6135009857845 

6135009857845 

6135009857845 


INSERT  INTO  HDFS_R001_ALLOWANCES_NSN  (NSN, 
INSERT  INTO  HDFS_R001_ALLOWANCES_NSN  (NSN, 
INSERT  INTO  HDFS_R001_ALLOWANCES_NSN  (NSN, 
INSERT  INTO  HDFS_R001_ALLOWANCES_NSN  (NSN, 
INSERT  INTO  HDFS_R001_ALLOWANCES_NSN  (NSN, 


SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 


Figure  4.13:  NSN  Output  from  the  Fifth  Algorithm 


4.2.6  Top-lOO-NSN  Program  Conclusion 

The  result  of  the  entire  NSN  program  is  the  ereation  of  eleven  new  tables  in  the  database 
with  an  aggregated  916,190  rows.  Table  4.2  shows  the  first  ten  rows  of  data  form  the 
HDFS_R001_INVENT0RY_NSN  table  ereate  by  the  NSN  program.  The  power  of  this  type 
of  analytie  is  that  it  gives  the  eustomer  the  power  to  analyze  all  data  and  filter  their  data 
down  to  only  the  data  they  are  eoneerned  with  at  the  time.  This  ereates  the  flexibility  to 
maintain  and  analyze  their  entire  data  set  and  also  reduee  it  down  to  a  manageable  subsets 
of  needed  data.  This  program  does  just  that;  it  seans  all  of  the  data  and  then  returns  to  the 
database  only  the  data  of  the  most  frequent  NSNs  with  the  redueed  subset  of  the  original 
data.  Although,  this  program  writes  a  subset  of  the  data  baek  to  the  database,  it  just  as 
easily  eould  reduee  the  data  further  into  fewer  tables  by  joining  the  data  from  any /all  of  the 
tables.  The  possibility  of  the  analyties  Hadoop  ean  perform  is  limitless. 


Table  4.2:  Sample  of  HDFS R001 INVENTORY NSN  Table 


NSN 

Source  Table  Name 

RECDRD ID 

STATUS C0DE 

SERIAL NUM 

TAMCN 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912925 

LATEST 

10328808 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912926 

LATEST 

10328831 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912927 

LATEST 

10328995 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912928 

CREATED 

10329091 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912929 

LATEST 

10329141 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912930 

LATEST 

10329204 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912931 

LATEST 

10329312 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912932 

LATEST 

10329315 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912933 

LATEST 

10329316 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912934 

LATEST 

10329351 

E14422M 
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4.3  SQL  Simulation 

The  SQL  simulation  program  was  an  analytic  taken  from  another  ongoing  Naval  Postgrad¬ 
uate  project  that  is  examining  the  same  GCSS-MC  data.  We  felt  it  prudent  to  create  an 
analytic  on  something  the  USMC  is  looking  for  currently.  The  other  project  is  focusing 
on  using  SQL  to  generate  analytics  and  then  display  the  results  in  a  web  tier  architecture. 
Although  we  do  not  take  the  results  to  a  display  in  the  web  tier,  we  do  produce  the  same 
results  and  write  the  results  back  to  the  database.  The  overarching  idea  is  that  the  HDFS 
resources  are  used  to  process  the  data  and  then  something  like  a  web  tier  business  analytic 
suite  can  display  the  table  in  a  graph.  The  total  program,  we  will  call  it  Alpha,  consists  of 
four  MapReduce  algorithms.  The  purpose  of  the  analytic  is  to  produce  a  readiness  report 
aggregated  over  time,  equipment,  and/or  unit.  Figure  4.14  illustrates  the  data  relationship 
and  the  logic  used  in  the  Alpha  program. 

Alpha 


Figure  4.14:  Alpha  Program  Logic 


The  flow  of  the  entire  program  is  illustrated  in  Figure  4.15.  The  figure  gives  a  graphic 
representation  of  the  data  flow  through  the  whole  program  by  illustrating  the  data  at  each 
algorithm.  The  inputs  are  on  the  left  side  of  the  algorithm  boxes  and  the  output  at  the  right 
side.  The  inputs  to  the  top  of  the  box  are  used  by  the  algorithm  in  the  configure  method  of 
the  map  phase. 
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The  SQL  analytic  consists  of  data  calculations  and  three  table  joins.  Accomplishing  this  in 
Hadoop  requires  the  five  MapReduce  steps.  The  first  step  is  counting  all  NSNs  with  an  Op¬ 
erational  Status  that  are  "Deadlined"  value  in  the  XXMC_R001_SRHEADERS_TBL.  The  next 
step  scans  the  XXMC_R001_ITEMMASTER_TBL  and  gets  all  of  the  NSNs  that  are  "MARES" 
reportable.  The  third  step  scans  the  XXMC_R001_INVENT0RY_TBL  and  calculates  all  the 
number  of  "Onhand"  NSNs.  The  fourth  step  joins  the  outputs  from  step  two  and  three  and 
performs  the  calculation  for  percentage  of  "Deadlined"  versus  "Onhand".  The  final  step 
writes  the  data  back  to  the  database  table. 

4.3.1  SQL  Simulation:  First  Algorithm 

The  first  MapReduce  algorithm  performed  in  the  alpha  program  will  scan  the 
XXMC_R001_SRHEADERS_TBL.  The  map  phase  will  parse  out  the  data  and  find  all  of  the 
rows  that  have  an  Operational  Status  of  "Deadlined".  Once  the  "Deadlined"  value  is  seen 
in  the  row  the  NSN  associated  with  that  row  is  recorded  and  then  the  map  phase  will 
output  the  key/value  pair  of  the  NSN  and  the  value  one.  The  SRHEADERS  table  will 
have  and  entry  for  each  item  that  has  been  "Deadlined".  Therefore,  there  may  be  several 
items  "Deadlined"  associated  with  a  particular  NSN  and  we  need  to  have  the  total  number 
"Deadlined"  for  each  NSN. 

The  reduce  phase  will  take  care  of  summing  up  all  of  the  "Deadlined"  values  for  each  NSN. 
The  output  from  the  first  algorithm  is  the  NSN  and  the  total  number  of  items  "Deadlined" 
for  that  particular  NSN.  A  sample  of  the  output  of  the  algorithm  can  be  seen  in  Eigure 
4.17. 

4.3.2  SQL  Simulation:  Second  Algorithm 

The  second  MapReduce  algorithm  will  take  as  input  the  XXMC_R001_ITEMMASTER_TBL. 
The  alogrithm  will  also  access  the  output  from  the  first  algorithm  as  distributed  cache, 
Eigure  4. 15,  and  will  be  used  in  the  configure  method  to  build  a  hash  map  with  the  key /value 
pair  as  the  NSN/number  "Deadlined".  The  map  phase  will  then  scan  the  ITEMMASTER 
table  and  check  all  NSNs  to  determine  the  "MARES"  status.  If  the  NSN  is  "MARES" 
reportable,  the  algorithm  will  check  to  see  if  the  NSN  is  a  key  in  the  hash  map.  If  the 
NSN  is  a  key  in  the  hash  map  then  the  number  "Deadlined"  will  be  obtained  from  the  hash 
map  and  the  map  phase  will  set  the  output  key /value  pair  to  NSN/"MARES"_"Number 
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I 

1  ("ACTIVITY_ADDBZ9S_CODE":"MMCl00","BACK_ORDER_QUAWriTY":  "0" , ’•CONTROL_rTEM_CODE’' :  "null" , 
"DAY30_USAGE_RO":"null", "DUE_PROTISICW3" : "0", "Dt;E_STOCK" ;"0", "EXCH":"0", 

"FIXED_LBVEL": "0",  " rLCAT_RBORDER" : "25",  "FREEZE_CODE”: "Y" , "FREE2E_REASON": "GHOST  3N", 
"rREE2E_DATE"!  "2013-05-01  23:24M7.0",  "GABF_DATB" s  “2013-12-08  11; 31s 33.0", 
"LAST^^TRANSACTIWi^DATE":  “2013-10-15  13 :24  sOl.O",  "MATERIAL_ID_CODE"  ;"B",  "MOFFSET" :  "0", 

SRHEADERS  Table  "NOM_iY3TEM_IO__Cc5E"s"null",  "MO_lST_BECEIPT"s"null","OH_raOVI31OMS"s"0", 

"OH_3TOCR_SERVICEABLES" : "40" , "OH^UNSERVICEABLES" : "0", " PHRASE^CODE" : "4 " , 

"PRtME_NSN"s“4330010463399",  "NCMENCIATURE" : "FILTER  ELEMENT, FLUX",  "RECORD_NSN" : "4330010463399", 
"REORDER_DATE": "2013-12-05  06:07: 28 .0", "REORDER_P01NT" : "25", "BEQUISITIOK_OBJECTIVE" : "33", 
"ROCTlNG_ADtRESS_CODB”:"MMC300","ROUTING_IDENTIFIER_CODE":"MCl",  "3EC_COElE" :  "U", 

"SPL_ALLOW" :  "0" ,  "STORE_ACCOUNT_CODE" : "  1" ,  "3UPPLY_SOURCE_CODE"  : “null" .  "TOTAL_»«>_ALLOW" :  "0" , 
■UNIT_OF_I33UE" : "EA", "ONlT_PRlCE" s "7 .34  ", "PROCE33_3TATU3" : "Y" , "RECCRD_ID“ s "73906115" , 
"CBEATED_BY": "1277", " GREAT ION_D ATE" : "2013-12-08  11:31:33.0", "LAST_UPDATED_BY"; "1277", 
■LAST^UPDATE^DATE": "2013-12-08  11:31:33.0", "REQUEST^ID" : "27221076" , "BATCH^ID": "GENEP54431990", 
"EXTERNAL_AP?LICAT10N":  "merit",  "REQUlREMENT_COC£"  :  "IFIfC" ,  "OPERAT10M_COOE"T"61"  , 
"1IP_QUAMTITY":"0"J,  ”  ” 


Output 

Key/Value:  NSN,  Number  "Deadllned" 


1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 

Figure  4.16:  Alpha:  First  Algorithm 


Deadlined". 

The  reduce  phase  in  this  case  is  the  identity  reducer.  A  sample  of  the  output  of  the  algorithm 
can  be  seen  in  Figure  4.19. 
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1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 

Figure  4.17:  Alpha  Output  from  the  First  Algorithm 


4.3.3  SQL  Simulation:  Third  Algorithm 

The  next  MapReduee  algorithm  takes  the  XXMC_R001_INVENT0RY_TBL  as  input.  The  algo¬ 
rithm  will  use  the  output  from  the  second  algorithm  to  create  a  hash  map  in  the  configure 
method.  The  hash  map  will  store  the  NSN  as  the  key  and  the  value  will  be  a  string  that  con¬ 
tains  the  "Mares"  category  and  number  "Deadlined".  The  map  phase  is  going  to  scan  the 
INVENTORY  table  and  check  if  the  NSN  is  in  the  hash  map  and  if  it  is  then  get  the  value 
for  the  quantity  "Onhand".  The  map  output  is  the  key /value  pair  NSN/number  "Onhand". 

The  reduce  phase  will  calculate  the  total  number  "Onhand"  for  each  NSN.  A  sample  output 
of  the  algorithm  can  be  seen  in  Figure  4.21. 

4.3.4  SQL  Simulation:  Fourth  Algorithm 

The  fourth  MapReduce  algorithm  takes  the  output  from  the  second  algorithm  as  input  and 
the  output  from  the  third  algorithm  as  distributed  cash.  The  configure  method  is  used  to 
build  a  hash  map  from  the  third  algorithm  output.  The  map  phase  then  performs  a  map 
side  join  on  the  output  from  the  previous  two  algorithms.  The  input  data  is  parsed  and  then 
values  from  the  hash  map  are  appended.  The  algorithm  is  also  responsible  for  performing 
the  percentage  of  "Deadlined"  vs.  number  "Onhand".  The  map  output  is  the  key/value  pair 
NSN/number  "Deadlined"_"MARES"  Category_number  "Onhand"_Percentage. 

The  reduce  phase  is  the  identity  reducer.  A  sample  of  the  output  of  the  algorithm  can  be 
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1  {••ACTIVITX_ADDRESS_CODE":"MMC100*,“BACK_ORDER_QUAMriTY":"0",**CONTROL_ITCM_OODE":"null", 
"nAY30_USAGE_RD":"null", "DUE_PROVISICNS" :"0", "DUE_STOCK" :"0", "EXCH": -O", 

"FIXED  LEVEL" :"0","FL<»T  REORDER" ; "25".  "FREEZE  CODE"; "Y" , "FREEZE  REASON": "GHCOT  SN", 
-FREE2E_DATE": "2013-05-01  23:24:47.0", “GABF_DAra" : "2013-12-08  llT31:33.0". 

ITEMMASTER  Table  "IAST^TRANSACTION^DATE":  "2013-10-15  13:24s01.0",  "KATER1AX._ID_C0DE"  :"B*,  "MOFFSET*:"0", 

"NaJ_SYSTEH_ID_CODB":  "null",  "NO_lST_RECEIPT" :  "null",  •OH_PROvIsiONS":  "0", 
"OM_3roCK_SERVlCEABLES":"40", "o5_UNSERVlCEABLES" ; "0",  "PHRASE_CODE" :"4", 

"PRIMEJ«SN":"43300104€3399",  "N31ENC1ATORE" : "FILTER  ELEMENT,  FUJI" ,  "RECORD_NSN" ; "4330010463399", 
"REORDiR_DATE": "2013-12-05  06 r 07:28 .0", "REORDER_POINT" : "25", "REQCJISITION_oijECTIVE" : "33", 
"ROOT1W3_ADDRESS__CODE":  "MHC300*,*ROOTING_lDEIiTIFlER_COOE":  "MCI",  "SEC__COEe"  :  "IT. 

"SPL__ALLOW":"0*,  "3TORE__ACCOlJNT_OODE"  :  "  1" ,  "SDPPLY_SWRCB_CODE"  :  "null" ,  •TOTAL_MO_ALLO«"  :  “0" , 
•UNIT_OF_lSSUE": "EA". "UN1T_PRICE" : "7 .34", "PROCESS_SIArOS": "Y", "RECC»D_1D" : "73906115", 
"CREATED^BY": "1277", "CREATION_DATE" : "2013-12-08  11:31:33.0", "lAST_CPDATED_By" : "1277", 
"LAST_UPDATE_DATE"; "2013-12-08  11:31:33.0", "REQOEST^ID" : "27221076", "BATOl'lO" : "GENEP54 431990", 
"EXTERNAL_APPLICATION": "merit", "REQOIREMEMr_CODC":"TFtJC","OPERAT10N_CODE" s "61", 

" I I P^QUANTI TY * : " 0" ) , 


Configure 

Key/Value:  NSN,  Number  "Deadlined" 

1005007265636  187 

1005009573893  9 

1005010258095  21 

1005010351674  1 

1005011055191  1 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172_MARES 

Figure  4.18:  Alpha:  Second  Algorithm 


Output 

Key/Value:  NSN,  Number 
''Deaclllned"_"MARES"  Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


Algorithm  2 


seen  in  Figure  4.23. 

4.3.5  SQL  Simulation:  Fifth  Algorithm 

The  final  MapReduee  algorithm  will  take  the  output  from  the  fourth  algorithm  as  input. 
The  eonfigure  method  will  be  used  to  set  up  the  JDBC  eonneetion  to  the  database.  The 
map  phase  parses  the  input  and  ereates  and  executes  a  SQL  insert  statement  for  each  line 
of  input. 
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1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MfiRES 
l_MftRES 
0_MfiRES 
172  MARES 


Figure  4.19:  Alpha  Output  from  the  Second  Algorithm 


The  final  output  of  the  algorithm  is  the  key/value  pair  of  NSN/SQL  statement. A  sample 
output  of  the  algorithm  ean  be  seen  in  Figure  4.25. 


1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 

Figure  4.25:  Alpha  Output  from  the  Fifth  Algorithm 


4.3.6  Alpha  Program  Conclusion 

The  result  of  the  entire  program  is  a  join  aeross  three  tables,  ealeulations,  and  a  database 
table  ereated  that  eontains  all  of  the  data.  A  sample  of  the  resultant  Alpha  table  is  in  Table 
4.3.  The  purpose  of  this  analytie  was  to  show  that  Hadoop  ean  perform  the  same  type  of 
analysis  of  struetured  data  that  SQL  ean.  This  will  beeome  inereasingly  important  as  the 
data  set  approaehes  values  over  1TB.  The  methodology  ean  be  adapted  to  perform  all  types 
of  SQL  statements  with  no  limitation.  One  thing  the  Alpha  program  does  not  show  is  the 
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Input 

INVENTORY  Table 


{ 

1  I  "ACTIV1TY_ADDRESS_000C" :  "MMClOO* ,  •BACK_OW5CR_QUAfirXTY*’;  "O",  *COirTROL_ITEM_CODE"  :  •null’', 
"DAY30_USA6E_RD":'*nuil",  •DWE^PROVISICNS" :  "0",  "DUE^STOCK"  :  *0",  "EXCH":  "0", 

•FIXED_I,EVEL"  :  "O",  "PLOAT^REORDER" :  "25",  "PREEZEjCODB";  "Y",  •PREE2E_REAS0N" :  "GHOST  SN", 
"rREE2E_DATE"; "2013-05-01  23 ; 24 ; 47 .0", "GABr_DATE" : "2013-12-08  11:31:33.0", 
"IASr^TRANSACTlON_DATE":"2013-10-15  13:24:01.0", "MATERIAL_ID_OODE" :"B", "MOFFSET" : "0", 
"NC»*_iYSTE«_ID_COOE"  :  "null",  "MO_lST_BECEIPr" :  "null",  "OH_PROvIsiONS":  "0", 

"0H_ST0CK_SER71CEAS1.ES" :  "40" ,  "0M_UNSERV1CEAB1.ES" :  "0",  " eHRASE_COOE" : " 4" , 

"PRIME_NSN":*43300104633$9",  "NCMENCLATURE" : "FILTER  ELEMQIT, FUJI",  "REC0RD_NSN" : "4330010463399", 
"RE0REiER_DATE" : "2013-12-05  06:07:28.0", "REORDER_POIMr" ; "25", •REQUISIT10W_OBJECTIVE" : "33", 
"ROUTING_ADDR£S3jCODE":"MMC300","ROUTlNG_n5ENTlPlER_CODE"i"HCl", "3EC_CODE" : "O", 

"SPL_ALLOW"  ;"0" ,  "STORE_ACCOUNT_CODE" : "1" ,  "SUPPLY_3O0RCE_00DE" : "null",  •TOTAL_MO_A1,LOW"  :  "0", 
"UNir_or_ISSUE": "EA", "UNIT_PRICE":"7 .34", "PROCESS_STATUS": "Y", "RECORD_ID" : "73906115", 
"CREATED_BY": "1277", "CREATION^OATE": "2013-12-08  11: 31: 33.0", "LAaT_UPDATED_BY": "1277", 
•LASr_OPDATE_DATE":"2013-12-08  11:31:33.0", •REQUEST_1D" : "27221076", "BATCh”iD": "GENEP54431990", 
"EXTEBNAL_APPLlCATtON";  "merit",  "REipJlREMENr_COEIE" :  "lFUC","OPERATlCaJ_CODE" :  "61", 

"riP_QaANTITY" : "0" f , 


Configure 

Key/Value:  NSN,  Number 
''Deadrtned"_"MARES"  Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172  MARES 


Output 

Key/Value:  NSN,  number  "Onhand' 


1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 

Figure  4.20:  Alpha:  Third  Algorithm 


ability  to  use  multiple  inputs  to  perform  reduee  side  joins.  We  purposely  showed  the  map 
side  joins  because  we  felt  like  it  was  easier  to  demonstrate  the  data  flow  with  map  side 
joins.  However,  the  number  of  algorithms  can  be  reduced  by  using  reduce  side  joins.  There 
is  also  a  small  speed  up  that  can  be  achieved  on  the  reduce  side  join,  but  it  is  not  significant 
because  all  of  the  data  still  needs  to  be  scanned  the  same  amount. 
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1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 

Figure  4.21:  Alpha  Output  from  the  Third  Algorithm 


Table  4.3:  Sample  of  HDFS ALPHA  Table 


NSN 

NUMBER.DEADLINED 

MARES.CATEGORY 

QUANTITY.ONHAND 

PERCENTAGE 

3895014538573 

4 

MARES 

24 

16.67 

3895015390585 

1 

MARES 

22 

4.55 

3895015508369 

0 

MARES 

25 

0.00 

3895015733847 

1 

MARES 

1 

100.00 

3930014783519 

54 

MARES 

389 

13.88 

3930014862151 

1 

MARES 

6 

16.67 

3930015080886 

58 

MARES 

541 

10.72 

3930015227364 

6 

MARES 

106 

5.66 

3930015330855 

22 

MARES 

187 

11.76 

3930015735873 

2 

MARES 

5 

40.00 
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Input 

Key/Value:  NSN,  Number 
"DGadlined"_"MARES"  Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172  MARES 


Configure 

Key/Value:  NSN,  number  "Onhand" 


1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 

Output 

KeyA/alue:  NSN,  Number 
"Deadllned’'^"MARES"  Category^ 
number  "Onhand"_Percentage 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4 . 82 
9_MARES_497_1.81 
1_MARES_4_25. 00 
0_MARES_59_0. 00 
172  MARES  9108  1.89 


Figure  4.22:  Alpha:  Fourth  Algorithm 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4 . 82 
9_MARES_497_1.81 
1_MARES_4_25.00 
0_MARES_59_0.00 
172  MARES  9108  1.89 


Figure  4.23:  Alpha  Output  from  the  Fourth  Algorithm 
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Input 

Key/Value;  NSN,  Number 
"Deadllned”_"MARES"  Category^ 
number  "Onhand"_Percentage 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4 . 82 
9_MARES_497_1.81 
1_MARES_4_25 . 00 
0_MARES_59_0.00 
172  MARES  9108  1.89 


Output 

Key/Value;  NSN,  SQL  Insert  Statement 


1005007265636 

INSERT 

1005009573893 

INSERT 

1005010351674 

INSERT 

1005013592714 

INSERT 

1005014123129 

INSERT 

INTO  HDFS_ALPHA(NSN, 
INTO  HDFS_ALPHA(NSN, 
INTO  HDFS_ALPHA(NSN, 
INTO  HDFS_ALPHA(NSN, 
INTO  HDFS_ALPHA(NSN, 


NUMBER_DEADLINED  ...) 
NIJMBER_DEADLINED  ...) 
NUMBER_DEADLINED  ...) 
NUMBER_DEADLINED  ...) 
NUMBER_DEADLINED  ...) 


Figure  4.24:  Alpha:  Fifth  Algorithm 
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CHAPTER  5: 
Conclusion 


The  amount  of  data  that  is  collected  and  stored  continues  to  increase  everyday.  The  In¬ 
ternational  Data  Corporation  (IDC)  estimates  that  by  the  end  of  2013  data  stored  data  will 
be  2.7  zettabyte  (ZB)s,  that  is  a  forty-eight  percent  increase  from  2011  [19].  The  USMC 
is  no  different  than  the  commercial  world.  The  size  of  data  stored  in  USMC  databases  is 
growing,  with  the  current  size  of  the  GCSS-MC  database  at  six  TBs.  There  is  a  need  to 
find  a  solution  to  manage  data  storage  increases  in  the  USMC.  This  thesis  demonstrated 
that  big  data  analytics  could  be  added  to  the  current  GCSS-MC  architecture  to  address  the 
issue  of  giving  the  USMC  the  power  of  using  all  of  their  data  in  developing  analytics.  The 
remainder  of  this  chapter  further  explains  how  the  research  questions  have  been  answered 
and  covers  future  work. 

5.1  The  Outcome 

The  power  of  processing  GCSS-MC  data  in  Hadoop  is  promising.  The  thesis  shows  exam¬ 
ples  of  the  analytics  that  can  be  run  in  the  Hadoop  ecosystem  on  the  GCSS-MC  data.  This 
work  shows  promise  that  a  Hadoop  cluster  can  handle  the  analytics  that  are  needed  now 
and  is  flexible  enough  to  allow  for  the  programming  on  additional  analytical  needs.  With 
continued  efforts  and  exploration,  as  previously  mentioned,  the  power  of  big  data  analytics 
could  be  at  our  fingertips  providing  function  and  simplicity  to  a  complicated  large  data  set. 

The  research  questions  posed  in  Chapter  1  were  the  guide  to  the  overall  proof  of  concept 
behind  adding  big  data  analytics  to  GCSS-MC.  We  determined  that  a  big  data  element 
can  be  added  to  the  GCSS-MC  system  and  that  it  can  be  used  to  provide  data  analytics 
within  the  GCSS-MC  system.  After  completing  the  research  and  examining  the  results 
we  found  that  some  of  the  questions  were  broad  or  vague.  We  decided  to  pursue  this 
research  using  HDFS  and  as  small  of  a  software  footprint  as  possible.  This  thesis  shows  that 
concept  of  adding  Hadoop  to  the  GCSS-MC  system  is  achievable.  However,  more  work  is 
needed  to  show  how  Hadoop  could  be  integrated  into  a  system  that  more  closely  resembles 
a  production  GCSS-MC  system.  The  rest  of  this  section  will  reiterate  the  research  questions 
and  indicate  where  the  details  of  the  research  can  be  found  within  the  thesis. 
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The  first  research  question  is:  What  would  an  architecture  look  like  that  adds  a  big  data 
element  to  GCSS-MC?  This  question  is  very  broad  and  can  be  approached  several  ways. 
In  this  thesis  we  decided  to  use  HDFS  as  the  big  data  element  to  add  to  the  GCSS-MC 
system.  Furthermore,  we  built  a  HDFS  cluster  and  showed  how  it  could  be  used  as  a 
benefit  to  the  GCSS-MC  system.  Chapter  three  and  four  explain  these  results  in  detail. 

The  second  research  question  is:  If  an  architecture  can  be  developed,  what  modifications 
would  the  GCSS-MC  architecture  need?  As  discussed,  in  detail  in  chapter  three.  We  are 
using  a  sample  of  the  GCSS-MC  database  and  demonstrating  cluster  interaction  with  that 
database  though  IP  connectivity.  This  abstraction  of  the  real  GCSS-MC  system  worked 
very  well  in  this  thesis.  We  found  that  we  were  able  to  integrate  the  HDFS  solution  rather 
seamlessly  into  our  sample  database  and  experiment  architecture.  In  order  to  fully  prove 
that  this  approach  will  work,  it  needs  to  be  further  tested  on  an  actual  implementation  of 
the  GCSS-MC  system. 

The  third  research  question  is:  How  can  the  data  contained  in  GCSS-MC  be  imported  into 
HDFS?  In  this  thesis  we  used  a  JDBC  to  connect  to  the  sample  GCSS-MC  database  to  parse 
the  data  in  JSON  format.  Then  the  data  is  imported  into  HDFS  with  a  bash  script.  Chapter 
four  explains  in  more  detail  the  process  that  was  used  to  ultimately  get  the  GCSS-MC  data 
into  the  HDFS  ecosystem. 

The  fourth  research  question  is:  What  type  of  analytics  can  Hadoop  provide  for  GCSS-MC 
data?  This  thesis  explores  two  explanations  of  how  to  provide  analytics  on  GCSS-MC 
data.  Those  example  are  by  no  means  the  only  analytics  that  HDFS  can  provide.  They  are 
merely  representative  of  what  HDFS  can  provide.  More  specifically,  the  Alpha  program 
is  an  example  of  a  analytic  the  USMC  is  asking  to  be  completed  on  the  GCSS-MC  data. 
Chapter  four  explains  the  code  of  both  programs  in  great  detail. 

The  fifth  and  final  research  question  is:  How  will  the  data  get  back  to  the  GCSS-MC 
database?  We  choose  to  write  back  to  the  GCSS-MC  database  in  the  MapReduce  code.  We 
achieved  this  by  using  a  JDBC  call  within  the  map  phase  of  the  MapReduce  code.  Although 
this  is  not  the  only  way  to  achieve  writing  data  to  a  database  from  Hadoop,  we  felt  it  was 
the  best  way  to  achieve  the  data  write  back  functionality.  Chapter  four  discusses  this  code. 
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5.2  Future  Work 

This  thesis  demonstrates  the  potential  of  using  a  Hadoop  eluster  as  a  big  data  element  in 
the  GCSS-MC  system.  There  are  a  few  things  that  could  be  further  examined  to  enhance 
a  big  data  element  of  the  GCSS-MC  system.  This  section  will  focus  on  three  additional 
research  areas  that  could  be  extended  from  this  thesis. 

The  first  area  of  research  that  can  be  extended  from  this  thesis  is  moving  the  cluster  to 
a  larger  environment.  We  set  up,  the  cluster  on  two  machines  and  all  of  the  nodes  are 
virtualized.  In  order  to  better  represent  what  would  be  seen  in  a  GCSS-MC  production 
environment,  a  cluster  with  greater  than  twenty  nodes  should  be  used.  Additionally,  the 
security  of  the  cluster  and  the  interconnectivity  between  the  cluster  and  the  GCSS-MC 
should  be  examined. 

The  next  area  of  research  could  be  a  comparison  between  SQL  and  Hadoop  analytics. 
For  instance,  the  Alpha  program  is  derived  from  an  analytic  the  USMC  requested  in  the 
GCSS-MC  system.  The  Alpha  program  could  run  on  a  cluster  and  the  SQL  version  could 
be  executed  on  a  database  and  the  runtime,  compute  power,  and  memory  usage  could  be 
compared  between  the  cluster  and  database.  Furthermore,  there  are  four  more  analytics 
that  can  be  written  in  MapReduce  to  allow  for  a  more  complete  comparison. 

The  final  extension  of  research  to  the  thesis  could  be  the  addition  of  Hadoop  big  data  tools. 
Hadoop  offers  several  tools  that  help  in  the  processing,  analyzing,  and  importing/exporting 
data.  For  instance,  some  effort  could  be  placed  on  comparing  import/export  tools  against 
one  another  to  discover  which  tools  are  best  for  the  GCSS-MC  data.  Hadoop  also  offers  a 
tool  called  Hive.  Hive  allows  data  to  be  loaded  into  Hadoop  and  SQL-like  queries  can  be 
run  on  the  data.  Some  effort  might  be  placed  on  examining  performance  metrics  between 
running  a  MapReduce  analytic  versus  running  the  same  analytic  in  Hive. 
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APPENDIX  A; 

Hadoop  Testing  on  Single  and  Two  Node  Clusters 


The  Hadoop  Distributed  File  System  (HDFS)  is  a  cluster  computing  technology  that  uses 
a  parallel  computing  architecture  to  provide  fast  computing  and  redundant  storage.  The 
HDFS  system  is  an  open  source  project  from  Apache  that  is  based  off  of  Jeffrey  Dean  and 
Sanjay  Ghemawat’s  MapReduce  paper  [6].  The  intent  of  HDFS  is  to  bring  a  reliable  fault 
tolerant  parallel  architecture  to  the  open  source  community. 

The  main  idea  behind  the  parallel  computing  architecture  is  to  build  an  architecture  that  will 
allow  faster  computing  than  a  single  CPU  will  allow.  Even  with  Moore’s  Law  producing 
computing  power  that  doubles  every  twelve  to  eighteen  months  there  is  still  a  need  for 
faster  computing  to  deal  with  large  amounts  of  data.  Systems  like  HDFS  hope  to  answer 
the  demand  of  processing  large  amounts  of  data  quickly  over  a  distributed  environment. 
HDFS  allows  processing  of  large  amounts  of  data  using  a  cluster  with  1000s  of  nodes  (i.e., 
networked  computers). 

This  is  not  to  say  that  HDFS  is  a  panacea  and  the  answer  for  all  problems.  For  instance 
some  problems  do  not  lend  themselves  to  the  MapReduce  paradigm,  that  is  to  say  that  some 
problems  will  not  see  a  speed  up  when  computed  in  a  cluster  environment.  The  problems 
that  lend  themselves  well  to  the  MapReduce  paradigm  are  those  that  can  be  split  into  small 
chunks  and  computed  without  changing  the  final  outcome.  The  canonical  example  of  this 
is  the  WordCount  Program.  This  program  takes  multiple  files  as  an  input  and  computes 
the  number  of  occurrence  of  each  word  in  all  of  the  input  files.  This  can  be  illustrated  by 
running  the  computation  of  the  word  count  on  each  individual  file  on  a  separate  node  and 
then  combining  that  result.  This  can  be  achieved  significantly  quicker  than  it  would  be  to 
allow  one  computer  to  process  each  file  individually.  However,  you  can  see  that  the  addi¬ 
tional  step  of  combining  the  intermediate  files  adds  latency  that  would  not  exists  if  done  by 
a  single  computer.  That  is  considered  to  be  overhead  in  the  MapReduce  algorithm.  There 
are  other  sources  of  overhead  that  need  to  be  considered  as  well  (e.g.,  network  latency,  data 
locality,  etc). 
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The  purpose  of  this  experiment  is  to  analyze  the  speed  up  of  a  HDFS  cluster.  This  will 
be  tested  running  several  benchmarks  on  two  different  clusters.  The  first  cluster  will  be  a 
single  node  cluster  and  the  second  cluster  will  be  a  two  node  cluster.  Every  effort  will  be 
kept  to  control  the  environment  to  show  that  any  speed  up  or  slowdown  will  be  contributed 
to  the  HDFS  architecture  and  not  errors  in  the  experiment. 

A.l  Hypothesis 

The  experiment  will  examine  several  benchmarks  to  produce  a  speed  up  factor  over  each 
benchmark.  The  hypothesis  is  that  the  two  node  cluster  will  perform  better  than  the  single 
node  cluster.  The  two  node  cluster  will  perform  around  1.75  times  better  than  the  single 
node  cluster  due  to  the  overhead  required  by  HDFS  in  the  two  node  cluster  to  distribute  the 
work  between  the  nodes. 

A.2  Experiment  Architecture 

The  experiment  utilized  a  virtual  environment  to  set  up  the  two  clusters.  The  settings  for 
each  node  were  kept  the  identical  in  order  to  minimize  any  variance  in  the  experiment  setup. 
Each  node  was  built  in  Oracle  VirtualBox  version  4.3.2  using  a  Finux  Ubuntu  release,  see 
Table  A.l  for  additional  details.  Each  node  was  built  separately  and  configured  separately 
in  VirtualBox.  HDFSl  was  set  up  and  configured  to  be  the  single  node  cluster.  HDFS2  and 
HDFS3  were  set  up  and  configured  to  be  a  two  node  cluster. 


Table  A.l:  Initial  Node  Settings 


Node  Name 

HDFSl 

HDFS2 

HDFS3 

os 

Ubuntu  13.10  (64Bit) 

Ubuntu  13.10  (64Bit) 

Ubuntu  13.10  (64Bit) 

RAM  Size 

4096  MB 

4096  MB 

4096  MB 

HD  Size 

100.77  GB 

100.77  GB 

100.77  GB 

In  the  configuration  of  the  node  set  up  there  is  a  replication  factor  that  is  set.  This  factor 
is  how  many  times  the  data  is  replicated  across  the  nodes  in  a  cluster.  This  factor  was 
overlooked  when  the  initial  hypothesis  was  considered.  When  a  node  has  to  write  the  same 
data  more  than  once  the  time  to  write  is  going  to  take  longer.  However,  if  you  reduce  that 
factor  to  one  on  a  two  node  cluster  than  the  read  time  takes  longer,  because  the  systems 
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experiences  overhead  when  a  node  has  to  pass  the  data  to  the  other  node  to  read.  The 
results  can  be  found  in  Section  A.4.  The  experiment  architecture  (see  Figure  A.l)  was  set 
up  and  tested  to  ensure  the  basic  WordCount  program  would  run.  After  the  initial  testing 
was  run  and  confirmed  to  be  accurate  the  nodes  were  ready  to  begin  testing. 


Figure  A.l:  Experiment  Architecture 


A.3  Experiments  Run 

The  tests  that  were  run  in  this  experiment  were  a  write  test,  a  read  test,  and  a  TeraSort  test. 
These  benchmarks  are  all  available  as  part  of  the  HDFS  release.  The  tests  were  configured 
to  maximize  the  effectiveness  of  the  experimental  architecture.  The  setup  includes  single¬ 
node  and  two-node  configurations.  When  configuring  our  two-node  system  the  replication 
factor  was  accidently  set  to  two.  This  meant  that  when  we  are  writing  to  HDFS  the  cluster 
has  to  write  the  data  twice.  We  recognized  this  late  into  our  experiment,  but  still  had  time 
to  perform  additional  tests  with  a  replication  factor  of  one.  Each  configuration  underwent 
twenty  Read,  Write,  and  TeraSort  jobs. 

The  Write  test  would  generate  ten  files  that  are  1GB  each,  thus  putting  10GB  on  the  cluster. 
In  between  read  jobs,  the  cluster  would  be  cleaned  to  prohibit  the  drive  from  recognizing 
redundancy  and  refusing  to  write  for  subsequent  tests.  The  Read  test  will  read  ten  files  of 
1GB  each.  We  simply  do  not  delete  the  files  from  the  last  run  of  the  write  test.  Finally  we 
subject  each  configuration  to  a  TeraSort  test.  This  test  will  first  generate  10,000,000  rows 
of  one-hundred  byte  data.  In  total  this  is  1GB  of  data  to  sort.  After  each  run  the  output  files 
will  be  removed. 

Each  test  would  be  ran  twenty  times  against  each  configuration,  we  collected  and  summa- 
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rized  the  data  below  in  the  following  eharts. 


A.4  Results 

As  we  mentioned  previously,  write  testing  on  the  two-node  setup  takes  twenty  pereent 
longer  than  on  a  single  node  eonfiguration  (see  Figure  A. 2)  when  using  a  replieation  faetor 
of  two,  this  is  despite  the  fact  that  it  is  using  twice  the  system  resources.  This  can  be 
attributed  to  the  fact  that  our  test  VMs  are  using  the  same  underlying  hard-drives.  If  we 
had  a  true  set  of  servers,  this  result  set  would  likely  look  much  different.  Another  reason  for 
the  slowdown  is  because  the  HDFS  has  to  write  the  data  twice  for  redundancy.  A  standard 
HDFS  cluster  uses  triple-redundancy,  but  that  overhead  is  shared  over  dozens  of  servers. 


Write  Benchmark 

SingleNodeHadoop_RepFactor2 
SingleNodeHadoop_RepFactorl 
2NodeFladoop_RepFactor2 
2NodeFladoop_RepFactorl 
AVG  Execution  Time  in  Sec(s)  o  50  lOO  150  200  250  300 
Figure  A. 2:  Write  Benchmark  Results 


When  we  get  to  the  Read  tests  (Figure  A. 2),  our  results  are  more  as  we  expected.  When 
reading  in  a  DFS,  if  the  node  requesting  data  does  not  have  it  stored  locally,  it  results  in  a 
performance  impact.  Thus  our  two-node  setups  lagged  twenty  percent  and  sixty-six  percent 
respectively,  behind  the  single  node  configuration.  Interesting  from  an  outside  standpoint, 
the  two-node  double  replication  read  ran  quicker  than  a  single  replication  read.  This  is 
a  strength  of  a  DFS,  since  it  will  front-load  the  processing  time  to  write  data  to  multiple 
nodes.  The  result  is  much  faster  read  times;  since  the  data  is  in  multiple  places  for  nodes 
to  request  it  and  in  our  simulation,  that  meant  both  nodes  had  a  copy  of  the  data  to  read. 
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Read  Benchmark 

SingleNodeHadoop_RepFactor2  I 
SingleNodeHadoop_RepFactorl  I 
2NodeFladoop_RepFactor2  I 
2NodeFladoop_RepFactorl  I 
0 

AVG  Execution  Time  in  Sec(s) 
Figure  A. 3:  Read  Benchmark  Results 


316.435 

98.35 

108.135 

147.305 

50  100  150  200  250  300  350 


In  our  TeraSort  tests  (Figure  A. 4),  we  found  that  utilizing  a  second  node  provided  a  clear 
improvement  in  average  time.  Both  the  single  and  dual  replication  two  node  configurations 
completed  nearly  sixty  percent  faster  than  any  single  node  configuration. 


TeraSort  Benchmark 

SingleNodeFladoop_RepFactor2 

935 

SingleNodeHadoop_RepFactorl 

455 

2NodeHadoop_RepFactor2 

2NodeFladoop_RepFactorl 

0  50  100  150 

200 

AVG  Execution  Time  in  Sec(s) 

Figure  A. 4:  TeraSort  Benchmark  Results 


A.5  Conclusion 

We  concluded  that  even  though  we  were  able  to  see  performance  improvements  with  the 
TeraSort  and  Read  benchmark  our  underlying  architecture  prevented  us  from  seeing  that 
same  sort  of  improvements  in  the  Write  tests.  It  is  our  belief  that  the  main  reason  the  write 
tests  do  not  show  a  performance  increase  is  caused  by  the  replication  factor.  Exacerbating 
the  issue  was  the  fact  the  tests  were  performed  in  a  virtual  environment  that  has  to  access 
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the  same  hard  drive  (HD)  to  write.  In  other  words,  the  write  operation  ean  not  be  done  in 
parallel. 

In  order  to  try  and  limit  these  side  effeets  we  attempted  to  alter  the  replieation  faetors  to 
try  and  to  gain  an  equal  comparison.  This  effort  led  to  more  data  that  confirmed  some  of 
our  thoughts.  At  first,  we  forced  the  two  node  cluster  to  a  replication  factor  of  one.  So  with 
the  replication  factor  of  one  being  the  same  between  the  single  node  and  two  node  then  the 
write  should  perform  the  same  or  better  on  the  two  node  system  due  to  the  bottleneck  of  the 
HD  speed.  However,  what  we  saw  was  the  two  node  cluster  took  longer  to  write  the  same 
amount  of  data.  After  studying  the  HDFS  architecture  we  came  to  the  conclusion  that  the 
replication  factor  sets  how  many  times  the  data  is  written,  but  not  where  the  data  is  written. 
So,  on  the  two  node  cluster  the  master  node  (NameNode)  is  going  to  try  to  balance  the 
data  on  both  nodes  and  in  doing  so  is  going  to  experience  network  overhead  which  is  what 
causes  the  a  delay.  Additionally,  because  the  nodes  are  both  trying  to  write  at  the  same  time 
to  the  same  physical  HD  they  are  going  to  experience  a  wait  for  the  access  to  the  physical 
drive.  With  the  physical  HD  bottleneck  and  some  network  overhead  it  is  logical  to  think 
that  it  can  account  for  the  additional  1  Sms  it  takes  to  write  the  data  on  the  two  node  cluster. 

Moreover,  we  set  both  the  two  node  cluster  and  the  single  node  cluster  to  a  replication 
factor  of  two.  This  seems  unrealistic  on  a  single  node  system  because  the  single  node  is 
just  going  to  write  the  same  data  twice  on  the  same  HD.  Nevertheless,  the  tests  were  run  in 
an  attempt  to  gain  a  fair  level  of  comparison.  What  we  found  was  the  single  node  performed 
better  than  the  two  node  cluster  on  the  write  test.  Again  we  can  contribute  this  to  the  fact 
that  the  two  node  system  is  spitting  the  work  up  between  the  nodes  causing  some  network 
overhead  and  when  the  actual  write  happens  it  is  writing  to  the  same  physical  HD  causes 
a  slowdown  as  the  process  has  to  wait  for  access  to  the  HD.  With  the  HD  factor  and  the 
network  overhead  the  slowdown  of  41ms  seems  logical. 

After  performing  all  of  these  tests  and  studying  the  HDFS  architecture  we  belief  on  order  to 
make  a  fair  comparison  of  speed  up  would  be  to  perform  the  test  on  physical  cluster  rather 
than  virtual  instances  of  clusters.  The  test  would  also  be  more  effective  if  we  could  run  the 
tests  on  say  a  three  node  cluster  verse  a  six  node  cluster.  These  are  still  both  considered 
small  clusters,  but  they  should  be  big  enough  to  use  a  default  replication  factor  of  3  and  see 
the  performance  increase  at  a  fair  level  of  comparison. 
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APPENDIX  B: 

How  to  Setup  a  Single  Node  Hadoop  Cluster 


This  tutorial  will  guide  you  through  installing  a  single  node  Hadoop  cluster.  This  tutorial 
was  built  using  Tom  White’s  Hadoop:  The  Definitive  Guide  [7]  and  Michael  Noll’s  Hadoop 
Tutorials  [17].  A  single  node  cluster  is  a  great  way  to  start  working  with  the  Hadoop 
Distributed  File  System  and  the  MapReduce  paradigm.  Once  you  have  installed  and  tested 
the  cluster,  all  done  in  this  tutorial,  you  will  have  your  environment  set  up  to  start  testing 
MapReduce  code. 

Notes: 

•  This  tutorial  assumes  that  you  are  installing  the  node  on  a  debian  based  linux  plat¬ 
form. 

•  This  can  be  done  on  actual  hardware  or  on  a  virtual  machine. 

•  It  is  recommended  that  you  have  at  least  4GB  of  RAM  and  100GB  of  free  hard  disk 
space. 

1 .  open  up  a  terminal 

2.  Update  current  packages 

(a)  sudo  apt -get  update 

3.  Install  Java 

(a)  sudo  apt-get  install  openjdk-7-jdk 

4.  Verify  the  install 

(a)  java  -version 

5.  Install  nano 

(a)  sudo  apt-get  install  nano 

•  this  is  a  terminal  text  editor.  You  may  skip  this  if  you  choose  to  use  another 
terminal  text  editor  (i.e.  VI  or  emacs) 

6.  Hadoop  uses  ssh  to  talk  from  the  local  machine  to  the  namenode.  We  have  to  config¬ 
ure  ssh  for  that. 

(a)  ssh-keygen  -t  rsa  -P  "" 
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•  save  to  default  file 

(b)  if  you  do  not  have  a  ssh  server  install  one 

•  sudo  apt-get  install  openssh-server 

7.  add  the  key  we  just  created  to  the  authorized  key  file 

(a)  cat  $HOME/.ssh/id_rsa.pub  >>  $H0ME/ . ssh/authorized_keys 

8.  test  the  key 

(a)  shh  localhost 

•  should  give  you  a  message  that  localhost  has  been  added  to  list  of  known 
hosts 

9.  Hadoop  and  Ubuntu  have  a  conflict  with  IPV6.  The  workaround  we  will  use  is  to 
disable  IPV6 

(a)  sudo  nano  /etc/sysctl . conf 

•  add  these  lines  to  the  end  of  the  file 

-  #  disable  ipv6 

-  net . ipv6 . conf . all .disable_ipv6  =  1 

-  net . ipv6 . conf . default . disable_ipv6  =  1 

-  net . ipv6 . conf . lo . disable_ipv6  =  1 

10.  In  order  for  the  changes  to  take  effect  you  have  to  reboot  the  machine 

(a)  sudo  reboot 

11.  once  the  machine  has  rebooted,  open  a  terminal 

12.  check  to  see  if  the  changes  took  place 

(a)  cat  /proc/sys/net/ipv6/conf/all/disable_ipv6 

•  you  want  to  see  a  return  of  1 

13.  change  directory 

(a)  cd  /usr/local 

14.  Download  Hadoop 

(a)  sudo  Wget  http://download.nextag.eom/apache/hadoop/common/hadoop-l.2.l/hadoop-15 

15.  extract  the  tarball 

(a)  sudo  tar  xzf  hadoop-1 . 2 . 1 . tar . gz 

16.  move  the  folder  and  change  permissions 

(a)  sudo  mv  hadoop- 1.2.1  hadoop 

(b)  sudo  chown  -R  <your  username> : <your  group>  hadoop 
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17.  update  the  .bashre  file  for  the  hdfsuser 

(a)  sudo  nano  /home/<your  username>/ .bashre 

•  add  the  following  lines  to  the  file 

-  export  HADOOP_HOME=/usr/local/hadoop 

-  export  JAVA_H0ME=/usr/lib/jvm/java-7-openjdk-amd64 

-  unalias  fs  &>  /dev/null 

-  alias  fs="hadoop  fs" 

-  unalias  his  &>  /dev/null 

-  alias  hls="fs  -Is" 

-  Izohead  ()  { 

-  hadoop  fs  -cat  $1  Izop -do  I  head -1000  I  less  I 

-  > 

-  export  PATH=$PATH : $HAD00P_H0ME/bin 

18.  ereate  the  direotory  that  Hadoop  will  use  to  store  files 

(a)  sudo  mkdir  -p  /app/hadoop/tmp 

(b)  sudo  chown  <your  username> : <your  group>  /app/hadoop/tmp 

19.  Edit  eonfigurations  files  for  use  in  your  environment  and  point  to  the  direetory  we 
just  oreated 

(a)  ehange  the  Hadoop-env.sh  file  for  the  java  jdk  you  installed 

•  sudo  nano  /usr/local/hadoop/conf /hadoop-env . sh 

-  uneomment  the  export  JAVA_HOME  and  give  it  oorreot  path 

*  export  JAVA_H0ME=/usr/lib/jvm/java-7-openjdk-amd64 

20.  ehange  the  oore-site  xml  file 

(a)  sudo  nano  /usr/local/hadoop/conf/core-site.xml 

•  add  the  following  in  between  the  eonfiguration  tags 

-  <property> 

-  <name>hadoop.tmp.dir</name> 

-  <value>/app/hadoop/tmp</value> 

-  </property> 

-  <property> 

-  <name>fs .default .name</name> 

-  <value>hdf s : / /localhost : 54310</value> 
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-  </property> 

21.  change  the  mapred-site.xml  file 

(a)  sudo  nano  /usr/local/hadoop/conf/mapred-site.xml 

•  add  the  following  in  between  the  eonfiguration  tags 

-  <property> 

-  <name>mapred. job.tracker</name> 

-  <value>localhost : 54311</value> 

-  </property> 

22.  ehange  the  hdfs-site.xml 

(a)  sudo  nano  /usr/local/hadoop/conf/hdfs-site.xml 

•  add  the  following  in  between  the  eonfiguration  tags 

-  <property> 

-  <name>df s .replication</name> 

-  <value>l</value> 

-  </property> 

23.  Format  the  HDFS  file  system  and  the  name  node 

(a)  /usr/local/hadoop/bin/hadoop  namenode  -format 

•  you  should  see  some  output  during  ereation 

24.  start  your  eluster 

(a)  cd  /usr/local/hadoop 

(b)  bin/start-all .sh 

25.  run  jps  to  see  that  all  of  the  nodes  started 

(a)  jps 

26.  Download  book  for  wordcount 

(a)  make  direetory  on  the  loeal  file  system 

•  mkdir  /tmp/book 

(b)  ed  to  that  direetory 

•  cd  /tmp/book 
(e)  download  book 

•  wget  http://www.textfiles.com/games/abc.txt 
(d)  copy  files  from  local  file  system  to  hdfs 

•  cd  /usr/local/hadoop 
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•  bin/hadoop  dfs  -copyFromLocal  /tmp/book  /user/book 
(e)  check  to  see  if  it  is  there 

•  bin/hadoop  dfs  -Is  /user/ 

•  bin/hadoop  dfs  -Is  /user/book 

27.  run  pre-compiled  wordcount 

(a)  bin/hadoop  jar  hadoop*examples* . jar  wordcount  /user/book  /user/book-output 

28.  make  sure  the  output  is  there  and  look  at  it 

(a)  bin/hadoop  dfs  -Is  /user/ 

(b)  bin/hadoop  dfs  -Is  /user/book-output 

(c)  bin/hadoop  dfs  -cat  /user/book-output/part-r-00000 

29.  create  a  local  directory  and  move  files  there 

(a)  mkdir  /tmp/book- output 

(b)  bin/hadoop  dfs  -getmerge  /user/hdf suser/books-output-3  /tmp/book_output 

30.  stop  cluster 

(a)  bin/stop-all . sh 

This  completes  the  tutorial.  You  are  now  reading  to  begin  learning  how  to  use  the  Map  Re¬ 
duce  paradigm  and  writing  your  own  programs.  As  a  place  to  start,  I  recommend  modifying 
the  Wordcount  example  on  the  Hadoop  website. 
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